SuLab / schemaSpider

0 stars 2 forks source link

create prototype spider to harvest/visualize structured metadata for datasets #1

Open andrewsu opened 5 years ago

andrewsu commented 5 years ago

Google dataset search works by harvesting structured metadata on datasets. More background reading here and here. For example, this data set record in Harvard Dataverse https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF has structured data that can be used using Google's Structured Data Testing Tool, e.g., https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fdataverse.harvard.edu%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.7910%2FDVN%2FG4TBLF

This ticket would do some very light spidering of a few data repositories (see below) to create a cross-repository index of datasets. I think would like to do some analytics on what fields from schema.org/Dataset are being used. I'd also like to create a simple faceted browser for this dataset index.

General purpose repositories

Biology-specific repositories

(*) already have some embedded structured metadata

andrewsu commented 5 years ago

maybe also include scicrunch resources (eg https://scicrunch.org/resources/Any/record/nlx_144509-1/SCR_010250/resolver) which already have structured metadata and are indexed by Google dataset search (https://toolbox.google.com/datasetsearch/search?docid=TxwJwTw3pNWr0zULAAAAAA%3D%3D). scicrunch has API documentation here https://scicrunch.org/browse/api-docs/index.html?url=/swagger-docs/#/dataservices but not sure there is the equivalent of "show all resource records".

andrewsu commented 5 years ago

Noting that Harvard Dataverse now exposes funder as structured metadata according to schema.org/Dataset. For example, see this record and the extracted info from the structured data testing tool: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fdataverse.harvard.edu%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.7910%2FDVN%2FG4TBLF

Zenodo also has records with funding information, but it does not currently expose that info in the metadata:

https://zenodo.org/record/2556641#.XINShShKhaQ https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fzenodo.org%2Frecord%2F2556641%23.XINShShKhaQ