POLDER-Crew / polder-federated-search

A federated search project for POLDER.
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Index the Global Change Master Directory: repositories that contain more than just polar data #166

Open yemoski opened 1 year ago

yemoski commented 1 year ago

GCMD

Top priority relevant repositories that only contain polar data

Relevant repositories that contain data that needed to be scoped down to polar data

yemoski commented 1 year ago

You can get all of the data center urls by building a big query with each of their parameters as listed, like this: https://cmr.earthdata.nasa.gov/search/collections?data_center=AU/AADC&data_center=WGMS

yemoski commented 1 year ago

OK, our next step is to make a query that grabs all the urls for the datasets from the data centers in the first list, and then turn it into a sitemap that lets Gleaner crawl it.

yemoski commented 1 year ago

https://cmr.earthdata.nasa.gov/search/keywords/providers?pretty=true lists all the providers. so step 1 is to cross-reference them with our list of polar repositories so that we can just get datasets from the ones we want. @oluwayemisi4 is working on this.

yemoski commented 1 year ago

OK, here's how to do this: https://cmr.earthdata.nasa.gov/search/collections?provider=AU_AADC returns something that's like a sitemap for AADC datasets. We can request that, turn it into an actual sitemap (like we did for BAS), and crawl it.

What else is in the GCMD that we want?

yemoski commented 1 year ago

This is the same situation as the AMD - there's no json-ld in here, but there is API access, so we're going to have to figure something out.