climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Oak Ridge National Laboratory DAAC #316

Open JeremiahCurtis opened 7 years ago

JeremiahCurtis commented 7 years ago

Would anyone be interested in posting the various ORNL datasets as separate issues with size estimates (available from ORNL's order pages) I'm worried that we're missing large portions of ORNL data; maybe I'm wrong.

https://daac.ornl.gov/get_data.shtml

Typical issue could read:

ISLSCP Initiative II (ISLSCP II) Data Sets https://daac.ornl.gov/cgi-bin/dataset_lister.pl?p=29 198843.59 MB

I would do this myself, but am working on several downloads that have gone sideways

ghost commented 7 years ago

Azimuth Backup has taken a careful look at supplementing our coverage of this site. You are correct, we do not have this data at present. Gathering of this data is hindered by a requirement that the user register on the site, and define a passcode. Thereafter, signing in is a prerequisite for downloading any data. While the "dataset identifiers" for all datasets are known (to us, and anyone else who cares to know them), trying to download them directly using httrack, or wget, or even Lynx, using the username and passcode fails.

The only way we now know of gathering these is to bring up a browser on a server with a lot of disk space, and to work down the list manually.

Alternatively, a product designed to test Web pages, or provide general scripting, like AutoIt (see https://www.autoitscript.com/site/) or Sikuli (see http://www.sikuli.org/), could do this, and we may take a crack at using Sikuli.

But, for now we have nothing.

On Mon, Feb 6, 2017, at 10:46, JeremiahCurtis wrote:

Would anyone be interested in posting the various ORNL datasets as separate issues with size estimates (available from ORNL's order pages) I'm worried that we're missing large portions of ORNL data; maybe I'm wrong. https://daac.ornl.gov/get_data.shtml

Typical issue could read:

ISLSCP Initiative II (ISLSCP II) Data Sets

https://daac.ornl.gov/cgi-bin/dataset_lister.pl?p=29

198843.59 MB

I would do this myself, but am working on several downloads that have gone sideways — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/316
  2. https://github.com/notifications/unsubscribe-auth/AD3HB_48KvGEVhCdXSZP3MQqSY2djGGBks5rZ0BcgaJpZM4L4W4I
JeremiahCurtis commented 7 years ago

I have found that ordering all datasets within a particular collection, say checking all four dasets under the hydroclimatology collections page on https://daac.ornl.gov/cgi-bin/dataset_lister.pl?p=10, and then adding them to cart, yields the total size and number of files within said collection. After receiving a link within the email attached to one's earthdata account, clicking on the link takes one immediately to an https directory of all datasets contained within the collection.....

a good https download manager gets everything in the https directory provided Internet Download Manager (IDM) is working just fine for me; no problems with the site grabber [under "tasks" in the upper left-hand corner, just click on "run site grabber"; enter the https url (eg: https://daac.ornl.gov/orders/902e67515b706a52bbf9e4d60e798429/) as the start page on the grabber [no authorization needed if you're logged into ORNL on the same connection as the IDM grabber

I found that when logging into ORNL via earthdata, choose the option on the login page to stay logged in (on a home server or something of the sort; this seems to avert login problems with IDM (don't know about others such as httrack)........logging into ORNL from earthdata, opening the email links to the datasets, and running the download manager/site grabber all from the same connection [while staying logged into ORNL] seems to do the trick

Also, when running the grabber, make sure to give the project a unique name and save the project AFTER it has started running (in order to be able to resume downloading the directory from the last point before an unexpected shutdown; I learned this the hard way when beginning to use IDM)

I am working on the Regional/Global collections right now I did select "all files of the website except web pages and images", so the "guide" folders within the dataset directories do not contain the html guides, but all the "data" folders match so far

Finished:

River Discharge (RIVDIS) 1.4 MB Net Primary Productivity (NPP) 63 MB Hydroclimatology Collections 433 MB

Running:

ISLSCP II 198 GB Climate Collections 495 MB

On deck:

vegetation collections 144 GB VEMAP 14 GB Soil Collections 1.9 GB

one important note: I am working from a home server, so my connection speed is much slower than what some others have reported here on github (I rarely hit 1 MB/s total across all the applications I'm running [wget, IDM, Filezilla, etc.], and I have trouble getting much more than 10-20 GB per day), so if someone wants to jump in on ISLSCP II and the vegetation collections, it would help immensely

If the ORNL data is indeed some of the most immediately threatened, I think we should make this a high priority issue. It seems like we have some issues with copious participants, and others with very few; that's probably inevitable on a project of this magnitude,.....

markuslaker commented 7 years ago

I'm concerned about the ethics, and possibly the legality, of taking and publishing data that the owner has chosen to place behind a registration wall. Perhaps we should instead approach ORNL and talk to them about mirroring the data outside the US.

JeremiahCurtis commented 7 years ago

As far as I know, the ORNL servers are publicly funded, and there is no charge for requesting the data. My guess is that registration is required in order to limit the possibility of overloading ORNL's servers. We are not publishing data; we're distributing it.....there is a difference

JeremiahCurtis commented 7 years ago

FWIW, the data from field campaigns add up to just under 5TB:

157600 files totalling 4782436.84 MB (655 datasets)

I will try to get all the regional/global collections

if we had ample connections, we could possibly get the field campaigns; on a confounded slow home connection, I've pulled over 12 GB in about 13 hours from the regional/global collections

Given that ORNL orders are limited to 1TB per request, and that the https links provided in the requests remain available for one week, I would guess that ORNL's servers could hypothetically permit 1TB of downloads to another single server in that timeframe.....of course, I could be way off base here; maybe someone else would know?

markuslaker commented 7 years ago

@JeremiahCurtis, you're doing a great job here, an important job, and I don't want to derail it by arguing. I won't talk about my qualms any further. But I want to stay squeaky clean, and so I'll concentrate what resources I have on other repositories that offer unambiguously open access. I'll see you in other Climate Mirror issues. :-)

JeremiahCurtis commented 7 years ago

thanks....no offense taken..... :-)

fwiw, has anyone tried reaching out to ORNL for copies of the DAAC data?

edit: Internet Download Manager is starting to crap out at 99.99% on a lot of the downloads. Reloading the grabber and moving the temp files doesn't seem to work. I'm going to have to try something else, it seems

JeremiahCurtis commented 7 years ago

Finished ISLSCP II (minus the various "guides" subfolders, which appear to contain various html links with no appreciable data).....with my limited connection speed (about 600-800 KB/s), I am going to try to fill in some of the gaps on the newftp issue