Open nickrsan opened 7 years ago
Downloading this one as offline mirror and possibly as online mirror, depending on final size.
downloading. looks like the data is approximately 40GB
Download completed overnight. file size looks correct. generating md5sum
My hashes are
find seasat -type f -exec md5sum {} \; | sort -k 2 | md5sum
04c3a559a045e661db60fcffff30bfe3
find podaac.jpl.nasa.gov/allData/seasat -type f -exec md5sum {} \; | sort -k 2 | md5sum
9fa74f27a5467254645104602000009e
Hope we'll agree! (I'm on Windows, but used the Ubuntu Bash that comes with Windows 10. I hope this won't be an issue.)
@nubenum my hash turned out different. probably ins't related to OS. might be because i ran a slightly different command:
find podaac-ftp.jpl.nasa.gov/allData/seasat/ -type f -exec md5sum {} \; | grep -v '.listing' | sort -k 2 | md5sum
adf6466deb5099cdb6564d0913a0310e
hash is probably different because of slight difference in directory naming. will try one matching the first command that you ran.
I'll do yours in the meanwhile... Definitely different because md5sum always lists with the relative file path and this entire list gets hashed. So there's still hope...
Still not the same. Maybe there's actually a difference between podaac-ftp and podaac?
find podaac-ftp.jpl.nasa.gov/allData/seasat/ -type f -exec md5sum {} \; | grep -v '.listing' | sort -k 2 | md5sum
a1e5c29f0dd3e77126eff4b07b370c6d
Maybe dump the sorted md5sum list to a file and difference the 2 files?
i think we're going to have to try @axlecrusher's suggestion.
find seasat -type f -exec md5sum {} \; | sort -k 2 | md5sum
returned
5f56cabc7501dfeb0171b15ef5ed5c04
The sorted md5sum list: seasat-hashes.txt By the way, my dataset has 42,807,436,366 bytes, however, I've found some hidden (EDIT: empty) .snapshot directories that I vaguely remember to have seen on the server as well, but now aren't there any more. Are we sure that the files on the server don't get altered? Otherwise, it might well be an error on my side.
EDIT:
podaac-ftp.jpl.nasa.gov/allData/seasat/ -type f -exec md5sum {} \; | grep -v '.listing' | sort -k 2 > hashes.txt
alright, ran a diff, no idea if i did i right. hashes
is my file seasat-hashes.txt
is @nubenum's file.
diff hashes seasat-hashes.txt
Looks significantly different. I wonder if the files on the server are indeed being altered.
Currently downloading this. I'm pretty new to this but will try to follow other's comments.
Suggestion: does GitHub allow for changing an issue's title? It would be great to have a rough idea of the disk space required. There is no much point in attempting to copy too large a dataset.
Ok, I think I found the problem. The md5sums of the diffs are all correct/same, it's only the path:
podaac-ftp.jpl.nasa.gov/allData/seasat/retired/L2/smmr/SEASMR20_001/
podaac-ftp.jpl.nasa.gov/allData/seasat/retired/L2/smmr/seasmr20_001/
So it's only about the capitalization of the directory, which means that actually WINDOWS is the problem. Both SEASMR20_001
and seasmr20_001
exist on the server, so they got merged on my machine. So I'm apologizing for me using Windows. I'll try to fix that later, not at home at the moment. But if I haven't missed anything, we can already be quite sure that our datasets are actually identical!
@erikfriesen Ok great I got the same hash now!
adf6466deb5099cdb6564d0913a0310e
I simply changed the paths in my hashdump since I obviously can't create these two directories at once. I should definitely switch to Linux...
@axlecrusher Was a great idea, helped a lot to find out that Windows is stupid! @Cubytus That's the reason why I won't be downloading much else because I don't have that much storage and it is quite superfluous to have FileZilla index for half an hour only to find out that the dataset is too large for me.
oh wow! good to know that about windows. I thought I was going mad.
Idea: for those who can spend it, hubiC offers 10TB worth of storage for 50€ a year. Of course, it's not a FTP server, obviously, but lots of space for 50¢ per TB. Much cheaper than physical HDDs. One would have to know their ToS, though.
Would this be a good idea? Except maybe scientists and science-interested persons, not everybody can / is willing to spend hundreds of $ on storage to be installed at home.
I discovered that a tar.gz of this dataset is small enough for me to mirror online. So here you go.
http://mirror.nubenum.de/podaac.jpl.nasa.gov/allData/seasat/
I'll keep that up until there is another online mirror (Windows glitch included). I'll keep it up. Windows glitch fixed. Hash confirmed with @erikfriesen.
@Cubytus I think the folks of climatemirror.org are currently collecting donations to buy online storage because they want to keep it all in one place and they acknowledge that not many are able or willing to operate private mirrors.
@Cubytus, Amazon S3 storage is probably cheaper.
Total size: 40G
results of lftp :~> du -h ftp://podaac.jpl.nasa.gov/allData/seasat > seasat_size.txt
Grabbing this now
@gabefair 10TB on regular S3 storage would be around $230 a month / $2700 a year. However, Amazon Cloud Drive says "unlimited" for $60 a year. One would have to understand the limitations, though.
Done
Checksums: https://gateway.ipfs.io/ipfs/QmXPRgouXVPKMMUTQBNEzcmoVQ6PQMaMsbe88pDL6s8fB6 Root Directory: https://gateway.ipfs.io/ipfs/Qmbb2zrYZKFKJDMXDPAXd3WNK1hsRFdgtuXkF8xjB4omTy Size: 43GB
ftp:/podaac.jpl.nasa.gov/allData/seasat.
Suggested in a large email containing many urls