ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
109 stars 31 forks source link

Standard way of specifying a well-defined identifier #13

Closed johardi closed 4 years ago

johardi commented 5 years ago

I found inconsistencies for specifying a well-defined identifier in the guides document and I'm wondering which one would you endorse.

In my mind, I'm thinking a structure that's something like this:

"identifier": {
    "@type": "PropertyValue",
    "propertyID*)": "datacite:doi",
    "name**)": "DOI",
    "value": "10.1575/1912/bco-dmo.665253",
    "url": "https://doi.org/10.1575/1912/bco-dmo.665253"
}

*) I found an example that uses datacite:usesIdentifierScheme **) Instead of propertyID field

ashepherd commented 5 years ago

Thanks @johardi , during our P418 project, we first tried using propertyID, but the definition of that field doesn't expand CURIE's like datacite:doi to their fully qualified URIs. If publishers weren't using the same prefix you wouldn't get consistent results. We then tried writing code to do the expansion during the harvest, but then its hard to know what fields were intended to be expanded or not. SO, we decided to punt the issue and use a property that explicitly defined the Identifier scheme until further discussion, like your issue here!

My main issue with the Datacite vocabulary is that it doesn't seem to have had much adoption or advertising from Datacite, so I'm hesitant to say its the best vocabulary for our community for describing identifiers. On the other hand, I think they did a good job of modeling the problem.

Any thoughts on a different vocabulary and/or arguments change the guidelines to go in a different direction here?

@mbjones

steingod commented 4 years ago

I am not sure whether this fits here, but it is not obvious to me how multiple identifiers are supported for e.g. datasets. One concrete example from the meteorological perspective is that datasets often have multiple identifiers depending on the framework they are shared through, in addition to local (host) identifiers. Are there specific recommendations on this?

johardi commented 4 years ago

You can use the schema:identifier to facilitate multiple identifiers (see the description and examples here: https://schema.org/identifier).

This is how I applied it for an article in PubMed:

{
    "@context": "http://schema.org",
    "@type": "MedicalScholarlyArticle",
    "@id": "http://identifiers.org/pubmed/29674413",
    "identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "pubmed",
        "value": "29674413"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "pii",
        "value": "ajnr.A5646"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "doi",
        "value": "10.3174/ajnr.A5646"
      }
    ],
...
mbjones commented 4 years ago

After review, it seems we need to update our proposed ADR and Guidelines to reflect the examples shown here. The ADR is 13-schemaorg-identifier-as-PropertyValue.md.

rduerr commented 4 years ago

Currently the ADR states:

"We will encourage the use of schema.org/PropertyValue when describing persistent identifiers (PIDs)."

This is not a strong enough statement. This is one of the cases where specifying the controlled vocabulary for the type and the exact form of the value to use is important so that strong validators can be created.

Using the members of http://purl.org/spar/datacite/ResourceIdentifierScheme seems appropriate as does the format specification for the PID in the DataCite-MetadataKernel_v4.3.pdf document, though there are a few types where the datacite format is ambiguous and should be tightened up. Suggest a table be added to the documentation for the types and their appropriate formats.

mbjones commented 4 years ago

Over in PR #79, @datadavev asked the following, which I think we should resolve here:

In the third example, is the intent really to assert that the graph node of type identifier has id of "https://doi.org/10.1234/56789"? Seems that id value should really be on the Dataset class.

mbjones commented 4 years ago

The SPAR DataCite ontology is not resolving at the moment, but you can also view it on GitHub: ResourceIdentifierScheme The values for the identifier schemes have specific URIs (albeit not currently resolving), but note that these URIs use a different capitalization than the names listed in the DataCite Kernel Metadata Spec. For example, the DataCite value ARK corresponds to the SPAR DataCite URI http://purl.org/spar/datacite/ark. Also, the SPAR vocabulary has more identifier types than the DataCIte specification.

The current DataCite PID vocabulary lists:

<xs:restriction base="xs:string">
    <xs:enumeration value="ARK"/>
    <xs:enumeration value="arXiv"/>
    <xs:enumeration value="bibcode"/>
    <xs:enumeration value="DOI"/>
    <xs:enumeration value="EAN13"/>
    <xs:enumeration value="EISSN"/>
    <xs:enumeration value="Handle"/>
    <xs:enumeration value="IGSN"/>
    <xs:enumeration value="ISBN"/>
    <xs:enumeration value="ISSN"/>
    <xs:enumeration value="ISTC"/>
    <xs:enumeration value="LISSN"/>
    <xs:enumeration value="LSID"/>
    <xs:enumeration value="PMID"/>
    <xs:enumeration value="PURL"/>
    <xs:enumeration value="UPC"/>
    <xs:enumeration value="URL"/>
    <xs:enumeration value="URN"/>
    <xs:enumeration value="w3id"/>
</xs:restriction>

None of these lists are complete. A much more comprehensive list is maintained at the identifiers.org registry. In addition, we need to include a way to specify a identifier string that is being used that is unique within a specific system but that is not in the above lists (e.g., an NCEI Accession number).

smrgeoinfo commented 4 years ago

Over in PR #79, @datadavev asked the following, which I think we should resolve here:

In the third example, is the intent really to assert that the graph node of type identifier has id of "https://doi.org/10.1234/56789"? Seems that id value should really be on the Dataset class.

I'm guessing this is referring to this kind of encoding:

{
    "@context": "http://schema.org",
    "@type": "MedicalScholarlyArticle",
    "@id": "http://identifiers.org/pubmed/29674413",
    "identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "pubmed",
        "value": "29674413"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "pii",
        "value": "ajnr.A5646"
      }]
}

I would think that if we were assigning an identifier to the sdo:identifier element, it would look like this:

{
    "@context": "http://schema.org",
    "@type": "MedicalScholarlyArticle",
    "@id": "http://identifiers.org/pubmed/29674413",
    "identifier": 
          "@id":"http://some.uri.com/ldskdgjrsr",
         [
         {
           "@type": "PropertyValue",
           "propertyID": "pubmed",
           "value": "29674413"
          }, ...

I don't see a problem with @johardi 's example (except for the propertyID identifiers...); the @type PropertyValue objects are values of the identifier element, consistent with the values expected for sdo:identifier.

johardi commented 4 years ago

@smrgeoinfo I'm not sure your example creates a valid JSON document.

johardi commented 4 years ago

I'd like to correct my example of using PropertyValue for specifying identifiers by referring back to my original post.

"identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "http://purl.org/spar/datacite/pmid",
        "name": "PMID",
        "value": "29674413",
        "url": "http://www.ncbi.nlm.nih.gov/pubmed/29674413"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "http://purl.org/spar/datacite/pii",
        "name": "PII",
        "value": "ajnr.A5646"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "http://purl.org/spar/datacite/doi",
        "name": "DOI",
        "value": "10.3174/ajnr.A5646",
        "url": "https://doi.org/10.3174/ajnr.A5646"
      }

Notes:

  1. @ashepherd: Schema.org specifies that the propertyID field can be either a prefixed string, a non-prefixed string or a URL that points to an external vocabulary or a web resource (see: https://schema.org/propertyID). The Example 8 and Example 9 demonstrate how to use the propertyID field in PropertyValue.
  2. @mbjones: Schema.org also specifies that the identifiers field can be used to provide additional or alternative identifiers (see: https://schema.org/docs/datamodel.html#identifierBg) to a resource that already has an @id field.
mbjones commented 4 years ago

That looks great to me @johardi . It makes both the identifier value and resolution URI explicit, and links the identifier to a formal type which helps machines understand the resolution semantics properly. So I would be supportive of the syntax you list as being the best practice guideline. Although we should add in the final closing square bracket.

@rduerr stated that we should be more explicit in our guidance about the controlled vocabularies to use, so can we also come to agreement on that? I would propose that we state that providers:

  1. SHOULD provide all identifiers used for the dataset as objects of type schema:PropertyValue
  2. SHOULD include a schema:propertyId for each identifier that links back to the identifier scheme using URIs drawn from the http://purl.org/spar/datacite/IdentifierScheme vocabulary or from identifiers.org registered prefixes from https://registry.identifiers.org/registry. If the identifier type does not exist in the SPAR datacite vocabulary or identifiers.org, use the best canonical URI for the identifier scheme that can be found.
  3. SHOULD include properties for the identifier, including both the value of the identifier (e.g., 10.3174/ajnr.A5646), and the url format, which can be repeated if multiple URIs exist (e.g., https://doi.org/10.3174/ajnr.A5646 and https://identifiers.org/doi:10.3174/ajnr.A5646). When possible, the value property should be expressed using its Compact URI format.

So, here are two example that use common identifier schemes not in the DataCite vocabulary, and use the Compact URI format for values:

"identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "https://registry.identifiers.org/registry/paleodb",
        "name": "PALEODB",
        "value": "paleodb:83088",
        "url": "https://identifiers.org/paleodb:83088"
      }
]
"identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "https://registry.identifiers.org/registry/pdb",
        "name": "PDB",
        "value": "pdb:2gc4",
        "url": "https://identifiers.org/pdb:2gc4"
      }
]

Anything else? Is that strong enough? Change SHOULD to MUST?

datadavev commented 4 years ago

Perhaps incorporate the imperative keywords from RFC 2119: Key words for use in RFCs to Indicate Requirement Levels, along with RFC 8174: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words.

Then, MUST applies to requirements that if not met, would render the subject useless for its intended purpose. Requirements with the SHOULD imperative add to the utility of the subject, but do not necessarily break systems should such recommendations not be met.

Is there a valid reason why an instance of identifier would not be of class PropertyValue? Similarly, are there valid reasons why the additional properties of the PropertyValue may not be included?

rduerr commented 4 years ago

Thinking about this more, perhaps the MUST's really only apply to things that would inhibit interoperability and re-use (the I and R in FAIR). I am also thinking that the guidance/specification and best practice text really should be clearly separated and identified as such (per #45). That says that for the most part the guidance can say Should but the Best Practice would have MUST. The point of the Best Practices (leading practices?) would be to maximize interoperability and re-use.

ashepherd commented 4 years ago

Proposing:

"identifier": [
      {
        "@type": "PropertyValue",
        "propertyID": "https://registry.identifiers.org/registry/pubmed",
        "name": "Pubmed ID #16333295",
        "value": "pubmed:16333295",
        "url": "http://www.ncbi.nlm.nih.gov/pubmed/29674413"
      },
      {
        "@type": "PropertyValue",
        "propertyID": "https://registry.identifiers.org/registry/doi",
        "name": "DOI: 10.3174/ajnr.A5646",
        "value": "doi:10.3174/ajnr.A5646",
        "url": "https://doi.org/10.3174/ajnr.A5646"
      },
     {
        "@type": "PropertyValue",
        "propertyID": "https://registry.identifiers.org/registry/paleodb",
        "name": "Paleo Database ID #83088",
        "value": "paleodb:83088",
        "url": "https://identifiers.org/paleodb:83088"
      },
     {
        "@type": "PropertyValue",
        "propertyID": "https://registry.identifiers.org/registry/pdb",
        "name": "Protein Data Bank 2gc4",
        "value": "pdb:2gc4",
        "url": "https://identifiers.org/pdb:2gc4"
      }
]

schema:value is the prefixed identifier value. This is a standardized format. schema:propertyID is the registry.identifiers.org URI for the identifier scheme. This is a standardized format. schema:url is some resolvable url for that identifier. This is a standardized format. schema:name (optional) should not be just the name of the ID scheme (i.e. "DOI"), but something more descriptive of this specific identifier

mbjones commented 4 years ago

Discussion on call on 2020-02-27. Decided that value should include the namespace prefix (e.g., pdb:1234 or doi:10.xxxx/alksjdskj)

ashepherd commented 4 years ago
dr-shorthair commented 4 years ago

ADMS is probably relevant here too.

ashepherd commented 4 years ago

https://github.com/ESIPFed/science-on-schema.org/blob/feature_13_specifying_identifiers/guides/Dataset.md#identifier

ashepherd commented 4 years ago

not sure I can argue the ROR and GRID ids yet as looking at their websites, it's unclear how to qualify for one. .I tested requesting a GRID for BCO-DMO, and will report back. For now, I think we punt that specific change into a separate issue.