climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Dataset at ftp:/podaac.jpl.nasa.gov/allData/seasat #209

Open nickrsan opened 7 years ago

nickrsan commented 7 years ago

ftp:/podaac.jpl.nasa.gov/allData/seasat.

Suggested in a large email containing many urls

nubenum commented 7 years ago

Downloading this one as offline mirror and possibly as online mirror, depending on final size.

erikfriesen commented 7 years ago

downloading. looks like the data is approximately 40GB

erikfriesen commented 7 years ago

Download completed overnight. file size looks correct. generating md5sum

nubenum commented 7 years ago

My hashes are find seasat -type f -exec md5sum {} \; | sort -k 2 | md5sum 04c3a559a045e661db60fcffff30bfe3

find podaac.jpl.nasa.gov/allData/seasat -type f -exec md5sum {} \; | sort -k 2 | md5sum 9fa74f27a5467254645104602000009e

Hope we'll agree! (I'm on Windows, but used the Ubuntu Bash that comes with Windows 10. I hope this won't be an issue.)

erikfriesen commented 7 years ago

@nubenum my hash turned out different. probably ins't related to OS. might be because i ran a slightly different command:

find podaac-ftp.jpl.nasa.gov/allData/seasat/ -type f -exec md5sum {} \; | grep -v '.listing' | sort -k 2 | md5sum

adf6466deb5099cdb6564d0913a0310e

hash is probably different because of slight difference in directory naming. will try one matching the first command that you ran.

nubenum commented 7 years ago

I'll do yours in the meanwhile... Definitely different because md5sum always lists with the relative file path and this entire list gets hashed. So there's still hope...

nubenum commented 7 years ago

Still not the same. Maybe there's actually a difference between podaac-ftp and podaac? find podaac-ftp.jpl.nasa.gov/allData/seasat/ -type f -exec md5sum {} \; | grep -v '.listing' | sort -k 2 | md5sum a1e5c29f0dd3e77126eff4b07b370c6d

axlecrusher commented 7 years ago

Maybe dump the sorted md5sum list to a file and difference the 2 files?

erikfriesen commented 7 years ago

i think we're going to have to try @axlecrusher's suggestion. find seasat -type f -exec md5sum {} \; | sort -k 2 | md5sum returned 5f56cabc7501dfeb0171b15ef5ed5c04

nubenum commented 7 years ago

The sorted md5sum list: seasat-hashes.txt By the way, my dataset has 42,807,436,366 bytes, however, I've found some hidden (EDIT: empty) .snapshot directories that I vaguely remember to have seen on the server as well, but now aren't there any more. Are we sure that the files on the server don't get altered? Otherwise, it might well be an error on my side.

EDIT: podaac-ftp.jpl.nasa.gov/allData/seasat/ -type f -exec md5sum {} \; | grep -v '.listing' | sort -k 2 > hashes.txt

erikfriesen commented 7 years ago

alright, ran a diff, no idea if i did i right. hashes is my file seasat-hashes.txt is @nubenum's file. diff hashes seasat-hashes.txt

Looks significantly different. I wonder if the files on the server are indeed being altered.

hashdiff.txt

Cubytus commented 7 years ago

Currently downloading this. I'm pretty new to this but will try to follow other's comments.

Suggestion: does GitHub allow for changing an issue's title? It would be great to have a rough idea of the disk space required. There is no much point in attempting to copy too large a dataset.

nubenum commented 7 years ago

Ok, I think I found the problem. The md5sums of the diffs are all correct/same, it's only the path: podaac-ftp.jpl.nasa.gov/allData/seasat/retired/L2/smmr/SEASMR20_001/ podaac-ftp.jpl.nasa.gov/allData/seasat/retired/L2/smmr/seasmr20_001/ So it's only about the capitalization of the directory, which means that actually WINDOWS is the problem. Both SEASMR20_001 and seasmr20_001 exist on the server, so they got merged on my machine. So I'm apologizing for me using Windows. I'll try to fix that later, not at home at the moment. But if I haven't missed anything, we can already be quite sure that our datasets are actually identical!

nubenum commented 7 years ago

@erikfriesen Ok great I got the same hash now! adf6466deb5099cdb6564d0913a0310e

I simply changed the paths in my hashdump since I obviously can't create these two directories at once. I should definitely switch to Linux...

@axlecrusher Was a great idea, helped a lot to find out that Windows is stupid! @Cubytus That's the reason why I won't be downloading much else because I don't have that much storage and it is quite superfluous to have FileZilla index for half an hour only to find out that the dataset is too large for me.

erikfriesen commented 7 years ago

oh wow! good to know that about windows. I thought I was going mad.

Cubytus commented 7 years ago

Idea: for those who can spend it, hubiC offers 10TB worth of storage for 50€ a year. Of course, it's not a FTP server, obviously, but lots of space for 50¢ per TB. Much cheaper than physical HDDs. One would have to know their ToS, though.

Would this be a good idea? Except maybe scientists and science-interested persons, not everybody can / is willing to spend hundreds of $ on storage to be installed at home.

nubenum commented 7 years ago

I discovered that a tar.gz of this dataset is small enough for me to mirror online. So here you go.

http://mirror.nubenum.de/podaac.jpl.nasa.gov/allData/seasat/

I'll keep that up until there is another online mirror (Windows glitch included). I'll keep it up. Windows glitch fixed. Hash confirmed with @erikfriesen.

@Cubytus I think the folks of climatemirror.org are currently collecting donations to buy online storage because they want to keep it all in one place and they acknowledge that not many are able or willing to operate private mirrors.

gabefair commented 7 years ago

@Cubytus, Amazon S3 storage is probably cheaper.

Total size: 40G results of lftp :~> du -h ftp://podaac.jpl.nasa.gov/allData/seasat > seasat_size.txt

entr0p1 commented 7 years ago

Grabbing this now

Cubytus commented 7 years ago

@gabefair 10TB on regular S3 storage would be around $230 a month / $2700 a year. However, Amazon Cloud Drive says "unlimited" for $60 a year. One would have to understand the limitations, though.

entr0p1 commented 7 years ago

Done

Checksums: https://gateway.ipfs.io/ipfs/QmXPRgouXVPKMMUTQBNEzcmoVQ6PQMaMsbe88pDL6s8fB6 Root Directory: https://gateway.ipfs.io/ipfs/Qmbb2zrYZKFKJDMXDPAXd3WNK1hsRFdgtuXkF8xjB4omTy Size: 43GB