ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
112 stars 32 forks source link

recommendation for indicating authoritative copy of dataset #37

Open mbjones opened 4 years ago

mbjones commented 4 years ago

Many datasets are present in multiple catalogs, including the original provider, the current host of the dataset, and at multiple aggregator sites that might maintain landing pages (e.g., at DataONE, data.gov, Cinergi, etc.). Aggregators like Google Dataset search harvest these entries from their multiple landing pages, and show where the dataset might be accessed in their listing. However, there is no indication of which of the sites maintains the authoritative copy of the dataset. For example, here's a view that shows three locations, but shows the DataCite logo even though Arctic Data Center is the authoritative holder of these data.

gds-arctic-data

While our dataset schema.org entry can specify includedInCatalog as part of its entry, that doesn't indicate which is the authoritative catalog/repository for the dataset. There is also some ambiguity over what the meaning of publisher is for these entries when the same data set can be published by multiple organizations. I'm also unclear what fields are used when generating the Dataset provided by display on Google Dataset Search, which sometimes lists one of the locations, and sometimes lists multiple. In the example above, only 2 of the three replica locations are shown. I suggest that we need a specific field that indicates authoritativePublisher or authoritativeRepository unless there is an existing term that plays that role. What is our recommendation for this concept?

amoeba commented 4 years ago

This is a good topic to bring up.

There is also some ambiguity over what the meaning of publisher is for these entries when the same data set can be published by multiple organizations.

There are tricky semantics here, especially given Schema.org's definition is pretty slim:

http://schema.org/publisher: The publisher of the creative work.

But as a note, DataCite does a nice job in their JSON-LD (which Google seems to actually ignore here). Here's the JSON-LD for your example, available at https://api.datacite.org/dois/application/vnd.schemaorg.ld+json/10.18739/a2dz03215, I see:

  "publisher": {
    "@type": "Organization",
    "name": "Arctic Data Center"
  },
  "provider": {
    "@type": "Organization",
    "name": "DataCite"
  }

So maybe publisher is a good fit for describing the authority and provider can be used for any copies?

smrgeoinfo commented 4 years ago

The trick is binding between the URL for accessing the resource and the publisher (authoritative, if we decide to make that the convention) or provider. This could be done by putting the information in distribution/DataDownload/publisher or distribution/DataDownload/provider (different distributions) to indicate 'authoritative' and 'alternate' sources.

ashepherd commented 4 years ago

see related BoF: https://docs.google.com/document/d/17hrcLpxcAA3_U3MZ3sWHrbaeNVa1k8yaigBzbjmxFHk/edit

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity.

ashepherd commented 4 years ago

@ashepherd invite ESIP Data Citation WG to weigh-in on this issue.

mbjones commented 3 years ago

As a data point, a paper on how Google Dataset Search handles the publisher field provides the following:

Providers: There is some ambiguity in schema.org on how to specify the the source of a dataset. We use the so#publisher and so#provider properties to identify the organization that provided the dataset. As with other properties, the value may be a string or an Organization object. Wherever possible, we reconcile the organization to the corresponding entity in the Google Knowledge Graph.

See:

@proceedings{49385,
title   = {Google Dataset Search by the Numbers},
editor  = {Omar Benjelloun and Shiyu Chen and Natasha Noy},
year    = {2020},
URL = {https://arxiv.org/abs/2006.06894},
booktitle   = {International Semantic Web Conference (ISWC-2020), In-Use Track}
}
mbjones commented 3 years ago

The display of the original dataset that I used as an example has now changed at Google Dataset Search, and it correctly lists the Arctic Data Center as the provider. We didn't change our schema.org metadata, so maybe something changed on the Google harvesting end.

data-provider
ashepherd commented 3 years ago

TO-DOs (ESIP Summer Meeting)

1) Clarify the meaning of publisher and provider terms and then look at use cases for when two different copies of the data are actually the same, how to represent in schema.org markup

2) Review paper: https://datascience.codata.org/articles/10.5334/dsj-2021-012/

3) In the guidelines, make it clear the difference between same dataset and need to use the provenance relationships for derivation

ashepherd commented 3 years ago

Notes from 8/26/2021 meeting: