Open max-mapper opened 7 years ago
https://earthdata.nasa.gov/nasa-data-policy
You may need an earthdata login to access some of the data, it's a free registration.
Also here is a list of all the servers FTP and HTTP from data.gov, which includes many NASA ftp servers https://gist.github.com/maxogden/9885244926c1ab576287ff5047dd0e5f
Working on Goddard Space Flight Center. Mr. Google sent me here:
And they encourage wget!!
https://disc.gsfc.nasa.gov/recipes/?q=recipes/How-to-Download-Data-Files-from-HTTP-Service-with-wget
I can code this up ... do we want to put it up on a server somewhere?
https://genelab-data.ndc.nasa.gov/genelab/projects
A very nice database for genetic research done IN SPACE!
Sam and I are doing NSIDC
For the Earth Sciences Level 1 and Atmosphere Archive and Distribution System (LAADS) DAACS, they have archived all of their data on both ftp and http sites: ftp://ladsweb.modaps.eosdis.nasa.gov https://ladsweb.nascom.nasa.gov/archive
Useful Readme of the data contained and how to access is here: https://ladsweb.nascom.nasa.gov/archive/README
Actually, it looks like all the DAACS' data is contained in the Common Metadata Repository: https://wiki.earthdata.nasa.gov/display/CMR/CMR+Client+Partner+User+Guide. Based off this, we would only need one scraper to pull all data from this system?
I've got dibs on crawling https://opendap.larc.nasa.gov/opendap/ 🚀
I took a look at the CMR page and started parsing the metadata provided at https://cmr.sit.earthdata.nasa.gov/search/collections.json.
I put together a script that traces the the files linked there with curl and outputs their final place after redirects: https://gist.github.com/crhallberg/eebc86dd74ec36e9f2f522ac1559cb7b.
That's just the bare-bones version. I also have one that does a lot more (saves collections.json, separates files into data, webpage, and broken, has status output) if needed.
@crhallberg awesomeness, do you have an idea of how many datasets are available under that collections endpoint? is each collection a big group of datasets? do you have an example of the metadata that your script produces?
I'm glad you asked because I'm still very new to this. There is a LOT more info here than I thought. My initial thought that what I was parsing was an update feed. Turns out I was on page 1 of 19,590 items. I still don't know how many. A part of the documentation I just found says "You can not page past the 1 millionth item." so there is (obviously) a heck of a lot.
Do you have any examples of good metadata that I can aim for as I interate on this?
@crhallberg hah! that's a lot of data :) if you wanna check out the data.gov metadata, the gold standard in my opinion, check out this guide i wrote last month https://github.com/jsonlines/guide. the main idea is you have a JSON object for each dataset, and that object has an array of resource URLs, one for each data file.
Is this related to the tweet https://twitter.com/denormalize/status/838550043397234691 ? I was wondering if you found a solution to the parallel ftp problem.
Update: I've identified 48,126 links. Some are invalid, some are ftp folders, I'm weeding through now by checking headers. After I've separated the wheat links from the chaff links, I'll reconcile it with the original metadata.
I will place a link here when I have a centralized place to show and tell progress: https://github.com/crhallberg/nasa-cmr-scraper.
I wasn't sure where else to push this, so I just made a new repository: https://github.com/crhallberg/nasa-cmr-scraper
If you are looking for data to scrape, here are some NASA acronyms to get you started:
https://data.nasa.gov (We learned yesterday that everything on data.nasa.gov is also on data.gov) GCMD Echo CMR DAACS OpenDAP NSIDC EOSDIS
Nasa data locations: Goddard Huntsville Oakridge JPL (LA) Ames
Our goal with these is to:
Comment below if you are working on one of these repositories