ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
113 stars 33 forks source link

SOSO should recommend how to specify identifier for the metadata record #210

Open smrgeoinfo opened 2 years ago

smrgeoinfo commented 2 years ago

In harvesting/federated metadata systems, there needs to be an identifier for the metadata record (in parallel to the identifier for the resource it describes), so that harvesters can look at time stamps and metadata identifiers to determine if they need to reharvest a record. Using the @id property in the JSON-LD object is the obvious solution, but SOSO should have recommendations that this identifier is stable and bound to the metadata for a particular resource. Looking at what we've been harvesting for the EarthCube GeoCODES, this is NOT the case with current metadata.

mbjones commented 2 years ago

@smrgeoinfo We discussed linking to associated metadata records and added guidelines in the 1.2 release to cover this case:

https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#metadata

We're using that in DataONE to follow the SO record to the more detailed ISO/EML/FGDC records that might already exist. Is that sufficient for your use case?

smrgeoinfo commented 2 years ago

@mbjones thanks, but that's not the issue. We're gleaning schema.org metadata from dataset landing pages, and finding that we're ending up with duplicate records for the same dataset because there's no identifier for the metadata record. Just because they're about the same dataset doesn't mean they are the same metadata record.

njarboe commented 2 years ago

MagIC has this issue as we allow people to update a dataset. This is necessary to fix errors in the dataset or when people want to include more data in the dataset than they originally added or when MagIC added new fields to the data model. We mint a data DOI for each version but those data DOIs point to the same page that highlights the most updated version, but also lists previous versions with those also available for download.

mbjones commented 2 years ago

@smrgeoinfo thanks for clarifying

@njarboe We have the same issue in DataONE, and the way we solved it is to differentiate the Persistent Identifier (PID) that maps to a specific content-immutable version of a file or package, and the Series Identifier (SID) that maps to the most recent version in a chain of versions. More details in the DataONE API docs When we harvest form a SO provider, we checksum the canonicalized version of the JSON-LD as the PID, and use the provided dc:identifier as the SID. When the repository modifies a record, that results in a new checksum (and a new PID), and we then update the SID to point at that most recent version. This allows us to maintain version history of all objects from the schema.org harvests, while also directing search results to only the most recent published version. I wonder if other aggregators could do the same?