daniellecrobinson / Data-Rescue-PDX

Volunteer guide, and other materials for DATA RESCUE PDX
30 stars 6 forks source link

NASA Scraper List #54

Open max-mapper opened 7 years ago

max-mapper commented 7 years ago

If you are looking for data to scrape, here are some NASA acronyms to get you started:

https://data.nasa.gov (We learned yesterday that everything on data.nasa.gov is also on data.gov) GCMD Echo CMR DAACS OpenDAP NSIDC EOSDIS

Nasa data locations: Goddard Huntsville Oakridge JPL (LA) Ames

Our goal with these is to:

Comment below if you are working on one of these repositories

max-mapper commented 7 years ago

https://earthdata.nasa.gov/nasa-data-policy

You may need an earthdata login to access some of the data, it's a free registration.

Also here is a list of all the servers FTP and HTTP from data.gov, which includes many NASA ftp servers https://gist.github.com/maxogden/9885244926c1ab576287ff5047dd0e5f

znmeb commented 7 years ago

Working on Goddard Space Flight Center. Mr. Google sent me here:

https://daac.gsfc.nasa.gov/

And they encourage wget!!

https://disc.gsfc.nasa.gov/recipes/?q=recipes/How-to-Download-Data-Files-from-HTTP-Service-with-wget

I can code this up ... do we want to put it up on a server somewhere?

jmicrobe commented 7 years ago

https://genelab-data.ndc.nasa.gov/genelab/projects

A very nice database for genetic research done IN SPACE!

sckott commented 7 years ago

Sam and I are doing NSIDC

samavar14 commented 7 years ago

For the Earth Sciences Level 1 and Atmosphere Archive and Distribution System (LAADS) DAACS, they have archived all of their data on both ftp and http sites: ftp://ladsweb.modaps.eosdis.nasa.gov https://ladsweb.nascom.nasa.gov/archive

Useful Readme of the data contained and how to access is here: https://ladsweb.nascom.nasa.gov/archive/README

samavar14 commented 7 years ago

Actually, it looks like all the DAACS' data is contained in the Common Metadata Repository: https://wiki.earthdata.nasa.gov/display/CMR/CMR+Client+Partner+User+Guide. Based off this, we would only need one scraper to pull all data from this system?

shawnbot commented 7 years ago

I've got dibs on crawling https://opendap.larc.nasa.gov/opendap/ 🚀

crhallberg commented 7 years ago

I took a look at the CMR page and started parsing the metadata provided at https://cmr.sit.earthdata.nasa.gov/search/collections.json.

I put together a script that traces the the files linked there with curl and outputs their final place after redirects: https://gist.github.com/crhallberg/eebc86dd74ec36e9f2f522ac1559cb7b.

That's just the bare-bones version. I also have one that does a lot more (saves collections.json, separates files into data, webpage, and broken, has status output) if needed.

max-mapper commented 7 years ago

@crhallberg awesomeness, do you have an idea of how many datasets are available under that collections endpoint? is each collection a big group of datasets? do you have an example of the metadata that your script produces?

crhallberg commented 7 years ago

I'm glad you asked because I'm still very new to this. There is a LOT more info here than I thought. My initial thought that what I was parsing was an update feed. Turns out I was on page 1 of 19,590 items. I still don't know how many. A part of the documentation I just found says "You can not page past the 1 millionth item." so there is (obviously) a heck of a lot.

Do you have any examples of good metadata that I can aim for as I interate on this?

max-mapper commented 7 years ago

@crhallberg hah! that's a lot of data :) if you wanna check out the data.gov metadata, the gold standard in my opinion, check out this guide i wrote last month https://github.com/jsonlines/guide. the main idea is you have a JSON object for each dataset, and that object has an array of resource URLs, one for each data file.

nichoth commented 7 years ago

Is this related to the tweet https://twitter.com/denormalize/status/838550043397234691 ? I was wondering if you found a solution to the parallel ftp problem.

crhallberg commented 7 years ago

Update: I've identified 48,126 links. Some are invalid, some are ftp folders, I'm weeding through now by checking headers. After I've separated the wheat links from the chaff links, I'll reconcile it with the original metadata.

I will place a link here when I have a centralized place to show and tell progress: https://github.com/crhallberg/nasa-cmr-scraper.

crhallberg commented 7 years ago

I wasn't sure where else to push this, so I just made a new repository: https://github.com/crhallberg/nasa-cmr-scraper