Open pdurbin opened 4 years ago
Just leaving questions to follow up on:
There's a notice at https://www.data.gov/developers/harvesting that:
Data.gov also syndicates data from state and local governments. However, non-federal data sources are governed by different terms of service and often different licenses than Federal data. When using or harvesting data from Data.gov, please note this distinction. When harvesting large volumes of data or metadata through Data.gov, we recommend you filter for Federal sources and separate non-federal sources to avoid commingling metadata without making this distinction.
Does that apply to dataset records they publish over oai-pmh? If so, how would someone harvesting from them make that distinction? How the datasets are organized by sets don't seem helpful for this. There are two sets, one called dataset and the other called interactiveResources. Each contains exactly 1,569,065 records. So do they both have the same records? If so, where are the other ~500k?
I tried harvesting into demo.dataverse and got an error:
https://catalog.data.gov/csw?mode=oaipmh: Invalid URL. Failed to establish connection and receive a valid server response.
We should consider harvesting these, especially since OAI-PMH is supported. Here's a screenshot of the output from the "Identify" verb at https://catalog.data.gov/csw?mode=oaipmh&verb=Identify
Docs, should we need them, have been kindly provided by @kalxas at https://twitter.com/tzotsos/status/1220770998146031619 and can be found at http://docs.pycsw.org/en/stable/oaipmh.html
There is also some discussion at https://github.com/GSA/data.gov/issues/888