climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Carbon Dioxide Information Analysis Center (CDIAC) FTP Archive #291

Closed donbright closed 1 year ago

donbright commented 7 years ago

Responsible Agency: Department of Energy (DOE) Agency division: Oak Ridge National Laboratory (ORNL) Agency subdivision: Carbon Dioxide Information Analysis Center (CDIAC) Datasets Bulk Download Link: ftp://cdiac.ornl.gov Data Type: from /README: "This archive contains selected data sets relevant to studies of greenhouse gases and climate."

from the CDIAC website:"CDIAC's data holdings include estimates of carbon dioxide emissions from fossil-fuel consumption and land-use changes; records of atmospheric concentrations of carbon dioxide and other radiatively active trace gases; carbon cycle and terrestrial carbon management datasets and analyses; and global/regional climate data and time series."

Sizes:

13G     pub   
4.0G    pub10
8.8G    pub11                                                                  
8.0G    pub12                                                                  
8.4G    pub2                                                                   
5.3G    pub4                                                                   
62G     pub6
26G     pub8                                                                   
5.9G    pub9                                                                   
140G    total

Related issues: The corresponding website for CDIAC is held by archive.org as described in issue #7 This issue might contain the datasets in #225 and #153, which are apparently accessing an ftp site via http.

RoundWorld commented 7 years ago

in progress

blueacid commented 7 years ago

Mirroring

ivanstegic commented 7 years ago

In progress, and will mirror too.

donbright commented 7 years ago

offline copy complete.

command used: lftp -e "mirror --parallel=8 -v" ftp://cdiac.esd.ornl.gov
Total: 4320 directories, 160307 files, 432 symlinks                            
New: 160307 files, 432 symlinks
149838645735 bytes transferred in 507434 seconds (288.4 KiB/s)
167 errors detected

update per @bkirkbri request, findmd5 hash: d2c070088f3809259e5ee68b9c45d296 2/2 added sort -k 2, per other issue comments

find cdiac.esd.ornl.gov/ -type f -exec md5sum {} \; > /tmp/ff cat /tmp/ff | grep -v .listing > /tmp/ff2 cat /tmp/ff2 | sort -k 2 > /tmp/ff3 cat /tmp/ff3 | md5sum


hashdeep: offline, contact me if data is needed( 90 megabytes)

bkirkbri commented 7 years ago

@donbright Thanks! Please post the output of find cdiac.esd.ornl.gov/ -type f -exec md5sum {} \; | grep -v \\.listing | md5sum when you can. @ivanstegic @RoundWorld @blueacid Please update with status and md5sum when your mirrors complete. Thanks!

bkirkbri commented 7 years ago

This is a superset of #225, which has been mirrored. Also a subset of #153, which has been closed.

blueacid commented 7 years ago

I'm only 41GB in... I think being in the UK means the overall transfer speed drops right off when handling many smaller files; the latency to start each file transfer becomes significant. Twinned with the fact these servers often have limits of 1-2 connections per IP and I go a bit slower than most.. it's a pity; my 100mbit connection is trundling along at around 20-25mbit utilisation.

bkirkbri commented 7 years ago

@blueacid Understood, thank you. @donbright Thanks for the hashes.

nrdufour commented 7 years ago

Downloading, and will be available at the URL http://climate.nemoworld.info/cdiac.ornl.gov/.

nrdufour commented 7 years ago

Downloaded and available at the URL http://climate.nemoworld.info/cdiac.ornl.gov/

ivanstegic commented 7 years ago

I got it too, all 140GB, I will publish it to a mirror tomorrow.

donbright commented 7 years ago

@empirical-bayesian at Azimuth Data Project has a copy as well:

https://bitbucket.org/azimuth-backup/azimuth-inventory/issues/3

pkclsoft commented 7 years ago

Downloading.

StephWo commented 7 years ago

mirrored at http://176.9.83.61/291 kind of.

hashdeep compare with @donbright says:

hashdeep: Audit failed Input files examined: 0 Known files expecting: 0 Files matched: 160270 Files partially matched: 0 Files moved: 28 New files found: 61 Known files not found: 349047

I checked the server twice, I dont know where those files are missing. I got a lot of 550 Errors though (Access failed: 550 Failed to open file)

Total size: 140G or 146724668 byte

pkclsoft commented 7 years ago

Mine is still downloading. Very slow, only at 15.5GB so far, but no errors.

donbright commented 7 years ago

@BauerPiepenbrink im going to re-run my hashdeep, its apparently way bigger than it should be

donbright commented 7 years ago

@BauerPiepenbrink i have updated my hashdeep... apparently the difference was that my copy had the symbolic links for pub3 pub5 pub7 (which are linked to pub2, pub4, whatever) and hashdeep was counting those as separate directories.

so i removed those 3 symbolic links from my filesystem and the hashdeep is much smaller now.

http://67.205.151.30/291/hashdeep.audit.txt

StephWo commented 7 years ago

@donbright OK, I think we are good. I rechecked your hashfile against my mirror and again I have about 150.000 entries with "Known file not used" , but as far as I can see those are also linked files in /pub"X"/data/level"X"/Sites_ByID/ linked to /pub"X"/data/level"X"/Sites_ByName

I would rather mirror those links too as they seem quite handy. Maybe I try to get those later.

Also I have 48 new files. It looks like there are some folders being renamed on the server, some others have been moved.

Files: http://176.9.83.61/291/291_hashdeep.txt - My hashfile http://176.9.83.61/291/291_diff_donbright.txt - differences if I compare my files to your hashfile

Report:

hashdeep: Audit failed Input files examined: 0 Known files expecting: 0 Files matched: 203211 Files partially matched: 0 Files moved: 0 New files found: 48 Known files not found: 149235

gabefair commented 7 years ago

Here is the current status of this issue: Confirmed Mirror: @nrdufour and @donbright and @BauerPiepenbrink

Downloaded: @ivanstegic and @donbright

Unknown: @RoundWorld, @blueacid ,@pkclsoft

pkclsoft commented 7 years ago

[update] Still downloading.

donbright commented 7 years ago

@BauerPiepenbrink i counted the symlinks in my tree, there are 429 using the command below. i used 'lftp mirror' in standard form which copies symlinks, so any method that doesnt copy symlinks will be have a hugely different hashdeep than mine since hashdeep apparently follows them in its standard form.

climir@vecher:~/offline/291$ find cdiac.esd.ornl.gov/  -type l | wc -l
429

still i am going to re-run my hashdeep a third and fourth time just to make sure.

StephWo commented 7 years ago

@donbright me too, i used lftp -c mirror ftp://...

Anyway, I only have 26 links (see http:/176.9.83.61/291/291_symlinks.txt if interested) As I said, i would rather backup all the other links too as they seem to be very handy for station-finding. when we reach the point where we agree that we downloaded the correct amount of data and all the files have the correct hashsums, I might ask you to send me the file tree with symlinks if I may.

We just have to verify how many unique files, excluding symlinks, we have to download.

I have a second hashfile made with "hashdeep -rl -of" so it should only hash regular files and folders, no symlinks and no linked files.

edit: http://176.9.83.61/291/291_hashdeep_noLinks.txt /edit

If you have the spare time, maybe you could do the same? The filecount of that hashdeep file is 160.321 actual unique files. Let's see how yours is :)

donbright commented 7 years ago

thanks for figuring this out @BauerPiepenbrink i am not sure what i did wrong.

hashdeep without symlinks: http://67.205.151.30/291/hashdeep.nosyms.txt

results of for i in `find cdiac.esd.ornl.gov -type l | sort `; do ls -l $i ; done > symlinks.find.txt http://67.205.151.30/291/symlinks.find.txt

results of find cdiac.esd.ornl.gov| sort : http://67.205.151.30/291/all.find.txt

results of find -L cdiac.esd.ornl.gov | sorthttp://67.205.151.30/291/followlinks.find.txt

line counts:

don@vecher:~$ wc -l hashdeep.nosyms.txt 
160270 hashdeep.nosyms.txt
don@vecher:~$ wc -l all.find.txt 
165018 all.find.txt
don@vecher:~$ wc -l symlinks.find.txt 
432 symlinks.find.txt
climir@vecher:~/offline/291$ wc -l followlinks.find.txt 
524569 followlinks.find.txt

size

climir@vecher:~/offline/291$ du -hs  cdiac.esd.ornl.gov/
140G    cdiac.esd.ornl.gov/
climir@vecher:~/offline/291$ du -s  cdiac.esd.ornl.gov/
146654708       cdiac.esd.ornl.gov/

you know.... the thing i did on this one that i dont usually do is to run lftp mirror in parallel mode. not sure how that would have made a difference but maybe it did.

StephWo commented 7 years ago

so, comparing to your nosym hashfile @donbright :

Input files examined: 0 Known files expecting: 0 Files matched: 160219 Files partially matched: 0 Files moved: 42992 New files found: 48 Known files not found: 46

so some renaming and moving was going on on the server as we said and then there are a few new files. If I take your nosyms.txt line-count (160270) and remove the header made by hashdeep (5 lines) we end up with 160265 lines in your file and 160267 files in my mirror. Looks good to me

I will remake the file with the differences between our mirrors and tidy that folder up a bit but I would say we got it :)

pkclsoft commented 7 years ago

so for me, my download is at about 46%. Given that there are now established mirrors, should I continue, or move on?

StephWo commented 7 years ago

@pkclsoft As my dataset is a bit different to donbrights I would appreciate a third hashfile if you can spare the bandwith and storage-space. But maybe I'm over-cautious.

pkclsoft commented 7 years ago

@BauerPiepenbrink no problem. I can spare it. Personally, I don't think we can be too cautious. I've just had a little rant on slack about that.

gabefair commented 7 years ago

Why risk it, better safe than sorry. Thanks @pkclsoft

pkclsoft commented 7 years ago

OK. Download finally completed. Have we settled on a specific set of commands to verify it, or compare it against other mirrors?

ivanstegic commented 7 years ago

I am happy to report that the full FTP site of 140GB is now available on the Our Data Our Hands server and will remain there permanently. You can save this in your bookmarks: https://data.ourdataourhands.org/cdiac.ornl.gov/

We're working to get as much data mirrored on our servers (and soon global grid!) as possible.

ivanstegic commented 7 years ago

Vote +1 to close

StephWo commented 7 years ago

@pkclsoft As everybody else was already using it here, I looked into hashdeep which is in the universal repos of most linux distros and it seems perfect for that job. The command I use is

hashdeep -rl -of path/mirrored.files/ | tee hashdeepfilename.txt

which will take the hashsums of every single file in that folder recursively (r) with relative paths (l) and only takes in account regular files, no symlinks (-of), counts them and puts all that in the txt-file you chose to create.

For the other way around, you can download another hashfile, mine for example, and do a

hashdeep -arl -vv -k other.hashfile.name.txt path/to.files.to.compare

wich will compare (a) recursively (r) and with relative paths (l) to a given hashdeep file (-k other.hashfile.name.txt) while being very verbose (vv this is optional) The output of that command will summarize and count changed files, moved files, found files and lost files.

pkclsoft commented 7 years ago

Thanks @BauerPiepenbrink . I'll kick jobs off with that command now. I've got two datasets complete now. Just need to build the hashes. Once that's verified, I'll create torrents for them as it's the only method I have to make them public.

BTW, the link: http://176.9.83.61/291/291_hashfile_noLinks.txt to your file is returning 'not found'

StephWo commented 7 years ago

@pkclsoft That was a typo. Thanks for pointing that out. Link is corrected to http://176.9.83.61/291/291_hashdeep_noLinks.txt

pkclsoft commented 7 years ago

When I mirrored for this issue, I grabbed cdiac.ornl.gov, not cdiac.esd.ornl.gov so my hash is not matching that of @donbright (funnily enough). Looking back over this issue, I see both urls mentioned. So are both correct? There are an awful lot of files in both that apparently match hashes however are in different directories (lot's of moved files).

Does anyone have a hash of cdiac.ornl.gov? Here is mine:

http://pkclsoft.com/downloads/cdiac.ornl.gov.hash.txt

StephWo commented 7 years ago

@pkclsoft So what I did is:

I got: Input files examined: 0 Known files expecting: 0 Files matched: 160028 Files partially matched: 0 Files moved: 43231 New files found: 0 Known files not found: 9948

note that "moved files" means identical file on another location. mainly those are .doc- files which are identical to same-named .txt files in the subdirectories. Those get recognised as "moved" although we both have the .txt and .doc files in our mirror. And then there are folders that moved. For example from /pub/ to /pub9/ or vice versa. so nothing to worry about. in terms of moved files.

The "Known files not found" are files that show up in your hashdeep file but not in my folders. I counted and checked: Every single filename of those 9948 files starts with a . Many of those are, for example .listing files which wget creates to list the contents of a remote folder. I'm sure we don't need those. And then there are some files like cdiac.esd.ornl.gov/pub10/XML_Maggart/.cdiac.amf.CR-Lse.m.txt or cdiac.esd.ornl.gov/pub8/oceans/VOS_Atlantic_Companion_Line/COM_2011/._COM_11_17_metadata.txt (note the . as first character of the actual filename)

lftp didn't download those and I don't know If I should add them to my mirror. But all in all, if you forget about the files starting with a fullstop, We have the same amount of files and they have not been changed. So I'm quite confident we covered this for know, we should regularly check for updates, I guess.

pkclsoft commented 7 years ago

Thanks. That's great then. All those files beginning with '.' are hidden files on the unix file system, so perhaps that explains why lftp didn't get them down. I don't know if they are important or not. I guess the thing is that we at least know we have one mirror of them.

Thanks for looking into it. I was thinking of writing a script that would take two hashdeep files and do the comparison using your steps. This might be faster for checking two mirrors.

I now need to work out how to serve the datasets. I've realised that I'm not able to seed a torrent from where I have the data at the moment, so I'll think about that side of it over the next few days. I'll get some more datasets if I can fit them just for the purpose of protecting them for now.

StephWo commented 7 years ago

Great! A script to compare hashfiles would be helpful, I guess. Even for this relatively small dataset it takes about 20 minutes to compare a hashfile with the actual data. I attempted to compare hashfiles with diff or sdiff but without some scripting to organize the output of that it's no use. And i already get the scares when I think about verifying the parts of issue 162 I'm downloading. One of the folders has 1.3 TB of files that are not bigger than 4 MB. Creating a hashsum of each and every one of them seems crazy.

DanTheMan827 commented 7 years ago

@pkclsoft create a torrent file of the cdiac.ornl.gov folder and add these web seeds to it

http://176.9.83.61/291/
http://climate.nemoworld.info/
http://data.ourdataourhands.org/
https://data.ourdataourhands.org/
h1z1 commented 7 years ago

Apologies in advance, maybe the wrong place to ask but I can help wonder if something like mirrorbrain and metalinks has ever been discussed? Many download managers have supported it for quite a while.

StephWo commented 7 years ago

I vote to remove the single-mirror tag and the In Progress-tag, add Multiple-Mirrors-Tag and close the issue.

StephWo commented 6 years ago

Be advised: because of changes in my hardware demands I wont be able to host this or the other datasets any longer after April 2018. Please create a copy if necessary before the end of April. The Full list of Dataset Issue-Numbers that are mirrored on my server and will not be hosted after April:

162 175 176 184 185 279 291 362

Find all these datasets at http://176.9.83.62 or http://climatemirror1.space