climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

https://www1.ncdc.noaa.gov/pub/data/ #298

Open JeremiahCurtis opened 7 years ago

JeremiahCurtis commented 7 years ago

https://www1.ncdc.noaa.gov/pub/data/

Contains several directories not appearing on the ftp site

markuslaker commented 7 years ago

This looks like a huge repo with a very large number of files. I tried `du -sh' in lftp and gave up waiting after about three days. (Transatlantic lag doesn't help.) So I've been tackling the subdirectories piecemeal and in parallel, measuring their sizes individually and, where even that isn't practical, trying to estimate them by extrapolating from subsets of the data. The sizes I've worked out so far are below. I already have parallel lftp jobs running for all the unknown sizes, and I'll fill in the gaps when I know more. All I can say at this stage is that subdirectories with no sizes against them are likely to be large.

Anyone who was able to run `du -sh' locally would save me a lot of time.

Any knowledge of which subdirectories to prioritise would also be helpful.

As for mirroring: I can do the smaller bits and pieces, but my domestic ADSL connection and 2009-era PC aren't beefy enough to handle a repository of this size. We should look for ways to break up the work once we know how much is there. Watch this space.

15min_precip-3260: 261MiB, measured. 109020: 154MiB, measured. ASOS_Station_Photos: 212MiB, measured. EngineeringWeatherData_CDROM: 96MiB, measured. Impact: 6.1GiB, measured access.del: nothing but stored error messages aewc-v1: 1.2GiB, measured airsea: 2.1MiB, measured annualreports: 76MiB, measured anomalie: 533MiB, measured anomalies: 525KiB, measured asos-fivemin: below 200GiB, extrapolated asos-onemin: between 400GiB and 900GiB, extrapolated blizzard: 4.6MiB, measured ccd-data: 9.9MiB, measured cdmp: 6.2GiB cdo: 6.4MiB, measured cirs: 1.4GiB, measured climgrid: empty cmb: 18GiB, measured coastal: 6.3GiB, measured cpo: 175GiB, measured crdr: 21MiB, measured documentlibrary: 305MiB, measured download: 28GiB, measured ecosystems: 253MiB, measured extremeevents: 590MiB, measured gcos: 752MiB, measured ghcn: 4.1TiB, measured by donbright globaldatabank: 232GiB, measured gpcp: 2.2GiB, measured gridded-nw-pac: 1.8Gib, measured gruan: 28GiB, measured gsn: 85MiB, measured gsod: 7.2GiB, measured by donbright hazards: 64MiB, measured hidden: 4.6MiB, measured homr: 66KiB, measured hourly_precip-3240: 269MiB, measured hpd: 14GiB, measured igra: 98GiB, measured images: 2.8GiB, measured inventories: 202MiB, measured ish: 3.7MiB, measured ispd: 37GiB, measured john: 357KiB, measured jrennie: 397MiB, measured lcd: 1014MiB, measured madis: empty mcdw: 621MiB, measured metadata: 657MiB, measured mlost: 108MiB, measured ncep_gts: 3.2GB, measured news media: 87MiB nidis: 5.4GiB, measured noaa: 224GiB, measured by donbright noaaglobaltemp: 56MiB, measured normals: 16GiB, measured nsrdb-solar: 60GiB, measured nwshly: 985MiB, measured oi-daily: 36MiB, measured oisst: 1.4GiB, measured olstore: 1.9MiB, measured paleo: 215GiB, measured papers: 824MiB, measured pmorpts_py: 1.9MiB, measured radar: 4.2GiB, measured ratpac: 25MiB, measured req201509: 32MiB, measured satellite: scpub201506: 515MiB, measured sds: 13GiB, but with `403 Forbidden' at sds/cdr/Scripts/ sensor_study: 355MiB, measured snowmonitoring: 650MiB, measured software: 56MiB, measured special: 118MiB, measured stations: 323MiB, measured swdi: 53GiB, measured by donbright techrpts: 443MiB, measured ua: 609GiB, measured uscrn: 201GiB, measured ushcn: 765MiB, measured usp: 487MiB, measured usrcrn: 2.1GiB, measured vosclim: 6.5GiB, measured w_pacific_typhoon_aircraft_fixes: 893KiB, measured wct: 19GiB, measured williams: 1.7GiB, measured wksst: 105MiB, measured ww-ii-data: 17MiB, measured wwr: 107MiB, measured

Not in subdirectories: about 400MiB.

donbright commented 7 years ago

@markuslaker your counts match my counts. . . . i have info for some of those gaps

4.1 T   ghcn
7.2 G   gsod
224G   noaa
53G    swdi

my lftp died after a few days(!) of running so i don't have any more info

gabefair commented 7 years ago

I have created a new ticket for the /ghcn data #331 Assuming its the same data

gabefair commented 7 years ago

You can find the pub/data/normals/ data at #286