climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

NCDC FTP Site #112

Open nickrsan opened 7 years ago

nickrsan commented 7 years ago

Name: NOAA Full NCDC Site Organization: NOAA NCDC Description URL: Download URL: https://www1.ncdc.noaa.gov/pub/data/ File Types: Size: Status: In progress - mirroring by non-GitHub user NCEI pub data mirrored by Azimuth Project: https://bitbucket.org/azimuth-backup/azimuth-inventory/issues/40/noaa-ncei-complete-pub-directory

bkirkbri commented 7 years ago

nClimGrid subset is issue #116

ftp://ftp.ncdc.noaa.gov/pub/data/climgrid/

bkirkbri commented 7 years ago

Local Climatological Data (LCD) subset is issue #117

ftp://ftp.ncdc.noaa.gov/pub/data/lcd/

bkirkbri commented 7 years ago

Paleoclimatology subset is issue #17

ftp://ftp.ncdc.noaa.gov/pub/data/paleo/

ghost commented 7 years ago

We received a report that the https://www1.ncdc.noaa.gov/pub/ is 12 Tb. We grabbed 620 Gb, but don't know the source, one Nick Gregory, and why he would know the size. Can anyone vouch for that number?

ghost commented 7 years ago

I just did a full directory walk of that ncdc.noaa.gov/pub, and got 29.620 Tb, 1325435 files, and 11686 folders. Thanks for any help people intended.

We cannot do all of this. Is there some sense to dividing it up? Please advise to climate -at- mm -dot- st. Thanks!

bkirkbri commented 7 years ago

I posted some subsets above. I agree it's best to break up what's left. I can claim some of them. Do you have sizes for top-level directories?

Thanks!

ghost commented 7 years ago

I can get these tomorrow. Tracking as Azimuth Backup Kickstarter Project Issue #77.

ghost commented 7 years ago

I am awaiting the /pub/data total but here, in the interim, is what I have. It's been running since mid-afternoon.

Note this is probably a lower bound. I received a number of 500 error codes during the run of the du against these directories, and, so, there were files whose sizes were missed. I will update when I have the final. The number above for /pub/data was another 30 Tb, but we'll see. noaa-ncdc-ncei-ftp-subdir-sizes-2017-01-17_170639

mejackreed commented 7 years ago

I can potentially grab some. What is left?

ghost commented 7 years ago

The FTP site remains in a "being copied" state. That said, it is not clear exactly where we are. We do have 3.9 Tb of it.

mejackreed commented 7 years ago

Ok, let me know if you need me to grab anything specific.

ghost commented 7 years ago

@mejackreed I think someone should make a run at Climate Mirror issue #42. No one as far as I know has even started it. We made a start, but its really incomplete, and the server does not always cooperate. I don't know if we are being throttled or what. I was/am trying:

wget -N -c --dns-timeout=10 --connect-timeout=300 --read-timeout=120 --wait=5 --mirror -e robots=off --random-wait --page-requisites --retry-connrefused --prefer-family=IPv4 --tries=40 --timestamping=on --recursive --level=8 --no-remove-listing --follow-ftp -nv --mirror --append-output=daac-ornl-gov-get-data.log --no-check-certificate https://daac.ornl.gov/

JeremiahCurtis commented 7 years ago

Sorry...I'm a newcomer here (been writing a book that is taking some time), but was just wondering if anyone is working on ftp://ftp.nodc.noaa.gov/pub/? thanks

JeremiahCurtis commented 7 years ago

I'm willing to grab whatever is needed to get a complete mirror if anyone has any idea where we stand on this....thanks