climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

SciTech Connect from the DOE #315

Open schwartzray opened 7 years ago

schwartzray commented 7 years ago

SciTech Connect links to freely available full text - over 500,000 objects.

“SciTech Connect Full-Text MARC Records include all the records from SciTech Connect that contain links to freely available full-text.  For the purpose of this service, full-text includes textual material, multimedia files, and datasets.  The subject disciplines covered by SciTech Connectinclude physics, chemistry, materials, biology, environmental sciences, energy technologies, engineering, computer and information science, renewable energy, and other topics of interest related to the DOE mission.” https://www.osti.gov/home/marcrecords.html

StephWo commented 7 years ago

mirrored at http://176.9.83.61/315 hash: http://176.9.83.61/315/315_hashdeep.txt

Filesize: 304 MByte

Propably low priority, but also small size

schwartzray commented 7 years ago

But BauerPiepenbrink, have the 500,00+ links within the file been harvested?

StephWo commented 7 years ago

good point. didn't even think of opening it. not a huge fan of unknown zip file sources.

Let's see if that is going to mak a new issue

StephWo commented 7 years ago

The links from the MARC file can not be processed automaticly and don't point to downloadable content. From 5 documents I pulled manually I got two Invoices, one paper about Neutrino scattering and two related best-practice guides for stainless steel pipes in wastewater extraction. Still not sure about the priority of this one...

markuslaker commented 7 years ago

Let's flesh out @BauerPiepenbrink's contributions a bit. The file, after unzipping, is in Marc format. After a quick sudo aptitude install yaz on Debian, it becomes possible to run yaz-marcdump -o line allRecords.mrc, which look, in part, like this:

msl@james:~/downloads/climate-mirror/www_osti_gov_MARC/extracted$ yaz-marcdump -o line allRecords.mrc | regrep '^[5-8]' | regrep -m 1 -B 99 ^8
500    $a Published through SciTech Connect.
500    $a 11/06/2001.
500    $a "dview"
500    $a " 005067mltpl00"
500    $a Dobos, Aron; Christensen, Craig; Horowitz, Scottt; Jerome, Eric; Kasberg, Michel; Janzou, Steve.
520 3  $a DView is a time series data visualization tool that provides several different ways to plot time series datasets. It is particularly well suited for browsing the results of energy systems simulation programs such as BeOpt, SAM, and PVWatts.
710 2  $a National Renewable Energy Laboratory (U.S.). $4 res
710 1  $a United States. $b Dept. of Energy. $4 spn
710 1  $a United States. $b Dept. of Energy. $b Office of Scientific and Technical Information. $4 dst
720 1  $a Dobos, Aron $4 aut
720 1  $a Christensen, Craig $4 aut
720 1  $a Horowitz, Scottt $4 aut
720 1  $a Jerome, Eric $4 aut
720 1  $a Kasberg, Michel $4 aut
720 1  $a Janzou, Steve $4 aut
856 40 $u http://www.osti.gov/scitech/biblio/1334276
msl@james:~/downloads/climate-mirror/www_osti_gov_MARC/extracted$

A program could extract some kind of author and title from this lot. Unfortunately, http://www.osti.gov/scitech/biblio/1334276 is a link to a download page, not to the document itself. To download, you have to answer twenty questions and submit a request, which gets processed by hand.

There are other resources, such as https://www.osti.gov/scitech/biblio/4124693, that are available immediately and anonymously, but they're clearly designed to defeat automated downloading. I therefore question the ethics of doing so, even if it should prove possible. Rather than trying to circumvent the user interface and grab the data, we might try asking nicely whether we can mirror it for them.

StephWo commented 7 years ago

@markuslaker Don't get me wrong, I don't want to make any decisions about priorities or the importance of any issues and of course, asking nicely to mirror that site is a great way to go. But I thought we try to get as much climate data as we can before it might get altered. I didn't find a single thing about greenhouse gases in any of the documents I saw from this source, yet.

The two examples you give, for example, are the link to a data visualization tool, I guess thats why you have to answer all those questions and get that processed manually, and an original report of efficiency for centrifuges, probably written in Los Alamos during the Manhattan Project or in Oak Ridge. GREAT stuff and I will add it to my personal collection because I really love it.

It just seems like a lot of effort to save the whole database because there might be climate related documents somewhere in there while we have so many issues which are still unmirrored and consist of pure climate science data.

Just sayin.

gabefair commented 7 years ago

1 vote to put on hold.