climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

NASA PODAAC AQUARIUS Satellite Data #236

Open nickrsan opened 7 years ago

nickrsan commented 7 years ago

ftp:/podaac-ftp.jpl.nasa.gov/allData/aquarius.

Suggested in a large email containing many urls

AdamBunn commented 7 years ago

I've started downloading this, but I don't have hosting at the moment

kenXengineering commented 7 years ago

I've started downloading aquarius as well, will host on Amazon Cloud Drive and IPFS.

astrobackup commented 7 years ago

Started downloading as well. If anyone has finished it yet, can you create a torrent, adding the ftp as a web seed?

kenXengineering commented 7 years ago

Status Update: Downloaded 622 GB, have 481 GB to go. It looks like each file transfer is limited to about 700-800 KB/s, and with filezilla I have 10 going atm.

I don't know if I'll get this all added to ACD, will prob take a long time to upload(10 Mb upload). I will add it to IPFS and will work on making a torrent at the least.

bkirkbri commented 7 years ago

@chosenken @astrobackup @AdamBunn Thanks to all of you for your help. Please update when your mirrors are complete.

astrobackup commented 7 years ago

I am 200GB in. It is quite slow so far but the issue is on my side of the connection.

kenXengineering commented 7 years ago

My download completed, and I am now verifying the files. With this data set, they provided a md5 file for each data file, so I wrote a quick and dirty script to go through each folder and check the md5sum. This will take some time. Its already found a few files that failed (most likely from when I had to restart the transfer a few times). BTW, the data set clocks in at 1.07 TB. I don't know if I can get this fully uploaded to Amazon Cloud drive, I'm already at 3.4 TB with them. I will be adding it to IPFS though, and creating a torrent with some web seeds setup. Should have this done in a day or two, it takes a while to validate all the files.

bkirkbri commented 7 years ago

@chosenken Thank you! Please post hashdeep -rl ./podaac-ftp.jpl.nasa.gov/allData/aquarius output when you can.

bkirkbri commented 7 years ago

@astrobackup Still downloading?

astrobackup commented 7 years ago

@bkirkbri yes, I have 506GB on my disk now. It takes ages but that's mainly because I am downloading it on a samba drive and Filezilla only transfer a file there after it is downloaded. My current rate is around 80GB/day so hopefully ~10 days left.

kenXengineering commented 7 years ago

@bkirkbri I ran the hashdeep command on the data directory. It took quite a while to run, and generated a 100+ MB file with all the hashes. I've 7zip'ed it up and it is available here

As for adding the data to IPFS, I had to move my IPFS data directory off of my NAS as it was causing issues. It is still transferring to my local disk(it has close to a million files in it already). I have about 400+ GB added to IPFS, so I am about half way done. Once the transfer is complete I can start adding to IPFS once again. I am expecting about a week to get everything done (it took 3 days to get 400GB add). There is a know issue with IPFS that it will not scale well with 8000+ files, and the Aquarius dataset is at 567,203 files on my system. They have a possible fix, so I'm going to build off of that and see if it helps.

EDIT:

Ok, so I got tired on waiting for IPFS to finished, and my slow upload will take forever, so I just spun up a DO box to download all the data and upload to ACD. The data can be accessed here. Once it is completed I will post the updated hashdeep output.

kenXengineering commented 7 years ago

Data uploaded to Amazon Cloud Drive, can be accessed here. Still working on getting data added to IPFS.

bkirkbri commented 7 years ago

@astrobackup Were you able to get this one? Thanks! @chosenken Thanks!

astrobackup commented 7 years ago

Still working on it, 740GB so far. Is there a way to rsync from ACD through the command line?

bkirkbri commented 7 years ago

@astrobackup Thanks for the update. Please update once your mirror completes. You are mirroring from the source, not from @chosenken's ACD, right? Just checking. In any case, it looks like there are two FUSE drivers for accessing ACD:

astrobackup commented 7 years ago

Yes, I am downloading from the source. I wanted to check what I already have with @chosenken data and thus do a rsync dry-run. But mounting that as a local drive means I'll need to download it anyway. I will work with the hashdeep file instead.

bkirkbri commented 7 years ago

@astrobackup great, thanks. There is definitely a need for a hashdeep comparison script. Someone on slack mentioned that.

kenXengineering commented 7 years ago

Ok, I got all the data added to IPFS. Took it about 4 days to get everything added. My IPFS data store is stored on a NAS, and the number of small files was giving it issues so it took a while. The root hash is Qmb86zba6KhGyfXP45fWn34WGt8CA6BufhjJGW816rxvKW. You can access it at

https://gateway.ipfs.io/ipfs/Qmb86zba6KhGyfXP45fWn34WGt8CA6BufhjJGW816rxvKW or

http://localhost:8080/ipfs/Qmb86zba6KhGyfXP45fWn34WGt8CA6BufhjJGW816rxvKW

if you are running IPFS locally. I captured the output into a text file, and it contains all the hashes for all the files. Its 24MB and extracts to 94MB. Note that the directory is 1.2 TB according to IPFS, so if you want to pin it you will need to configure IPFS with more storage (I think default is 10GB. I set mine to 2048GB).

For my next act I'm going to work on creating a torrent backed by IPFS (I think I can use ipfs as a HTTP source, will see on that).

NOTE; If you already have some of the data, and want to speed up pinning in IPFS, you can add it your self to IPFS and hopefully get the same hash back. As a reference, my folder structure was \podaac.jpl.nasa.gov\allData\aquarius, and I added the data with the command ipfs add -r podacc.jpl.nasa.gov. If you follow the same structure and add what data you have to IPFS, it should give the same hash(I hope).

kenXengineering commented 7 years ago

Ok, after a long day (this actually took like 6 hours to do so it kinda was) I have created 30 torrent files for the aquarius data set. Due to the number of files in the set, I was not able to create single all encompassing torrent. Instead I had to create torrents for specific directories, as some directories had over 100,000 files, which caused issues with the torrent software I was using. I found it could handle 40,000 files before it gave up.

Included in the zip is a read me, a script to check md5sums, and the output from hashdeep. It also has the root files (mainly text files) and software(sw) folder. With this one zip, you can recreate the entire data directory. All torrents should be backed by IPFS web seeds, though it is possible I may have messed up a URL on one or two. If that is the case, please let me know and I can fix them.

You can get the zip file here. The file is 41.6MB in size as the hashdeep output is quite large. Please let me know if you have any questions. I plan on seeding this for as long as I can. I can't guarantee 100% uptime, as I may need to stop uploads if my bandwidth is getting hit too hard, but I plan to run it for the foreseeable future. And since the torrents should be backed by IPFS, if it is cached on the gateway then you can still pull it.

kenXengineering commented 7 years ago

Hey, I just wanted to send out a quick update. I had an issue with IPFS and ended up having to reset everything on my machine. Luckly, adding the data back was easy as they have added a --nocopy flag, so IPFS doesn't create duplicate data files of the data, and instead just reads the data from the original files. I have reuploaded all the data and it is now under a new hash, QmcgqJRxgLJ5eUqZHeP5ftVQ1T5Y5eDnif4FkjV42dYXsm This contains all of aquarius and saral as of April 1st.

I'm also working on creating a new torrent with everything tar'd up instead of multiple torrent files. No one seemed to download the torrents yet so that shouldn't be an issue. Downside is I can't use IPFS as a web seed.