Closed johardi closed 4 years ago
Thanks @johardi , during our P418 project, we first tried using propertyID
, but the definition of that field doesn't expand CURIE's like datacite:doi
to their fully qualified URIs. If publishers weren't using the same prefix you wouldn't get consistent results. We then tried writing code to do the expansion during the harvest, but then its hard to know what fields were intended to be expanded or not. SO, we decided to punt the issue and use a property that explicitly defined the Identifier scheme until further discussion, like your issue here!
My main issue with the Datacite vocabulary is that it doesn't seem to have had much adoption or advertising from Datacite, so I'm hesitant to say its the best vocabulary for our community for describing identifiers. On the other hand, I think they did a good job of modeling the problem.
Any thoughts on a different vocabulary and/or arguments change the guidelines to go in a different direction here?
@mbjones
I am not sure whether this fits here, but it is not obvious to me how multiple identifiers are supported for e.g. datasets. One concrete example from the meteorological perspective is that datasets often have multiple identifiers depending on the framework they are shared through, in addition to local (host) identifiers. Are there specific recommendations on this?
You can use the schema:identifier
to facilitate multiple identifiers (see the description and examples here: https://schema.org/identifier).
This is how I applied it for an article in PubMed:
{
"@context": "http://schema.org",
"@type": "MedicalScholarlyArticle",
"@id": "http://identifiers.org/pubmed/29674413",
"identifier": [
{
"@type": "PropertyValue",
"propertyID": "pubmed",
"value": "29674413"
},
{
"@type": "PropertyValue",
"propertyID": "pii",
"value": "ajnr.A5646"
},
{
"@type": "PropertyValue",
"propertyID": "doi",
"value": "10.3174/ajnr.A5646"
}
],
...
After review, it seems we need to update our proposed ADR and Guidelines to reflect the examples shown here. The ADR is 13-schemaorg-identifier-as-PropertyValue.md
.
Currently the ADR states:
"We will encourage the use of schema.org/PropertyValue when describing persistent identifiers (PIDs)."
This is not a strong enough statement. This is one of the cases where specifying the controlled vocabulary for the type and the exact form of the value to use is important so that strong validators can be created.
Using the members of http://purl.org/spar/datacite/ResourceIdentifierScheme seems appropriate as does the format specification for the PID in the DataCite-MetadataKernel_v4.3.pdf document, though there are a few types where the datacite format is ambiguous and should be tightened up. Suggest a table be added to the documentation for the types and their appropriate formats.
Over in PR #79, @datadavev asked the following, which I think we should resolve here:
In the third example, is the intent really to assert that the graph node of type identifier has id of "https://doi.org/10.1234/56789"? Seems that id value should really be on the Dataset class.
The SPAR DataCite ontology is not resolving at the moment, but you can also view it on GitHub: ResourceIdentifierScheme The values for the identifier schemes have specific URIs (albeit not currently resolving), but note that these URIs use a different capitalization than the names listed in the DataCite Kernel Metadata Spec. For example, the DataCite value ARK
corresponds to the SPAR DataCite URI http://purl.org/spar/datacite/ark
. Also, the SPAR vocabulary has more identifier types than the DataCIte specification.
The current DataCite PID vocabulary lists:
<xs:restriction base="xs:string">
<xs:enumeration value="ARK"/>
<xs:enumeration value="arXiv"/>
<xs:enumeration value="bibcode"/>
<xs:enumeration value="DOI"/>
<xs:enumeration value="EAN13"/>
<xs:enumeration value="EISSN"/>
<xs:enumeration value="Handle"/>
<xs:enumeration value="IGSN"/>
<xs:enumeration value="ISBN"/>
<xs:enumeration value="ISSN"/>
<xs:enumeration value="ISTC"/>
<xs:enumeration value="LISSN"/>
<xs:enumeration value="LSID"/>
<xs:enumeration value="PMID"/>
<xs:enumeration value="PURL"/>
<xs:enumeration value="UPC"/>
<xs:enumeration value="URL"/>
<xs:enumeration value="URN"/>
<xs:enumeration value="w3id"/>
</xs:restriction>
None of these lists are complete. A much more comprehensive list is maintained at the identifiers.org registry. In addition, we need to include a way to specify a identifier string that is being used that is unique within a specific system but that is not in the above lists (e.g., an NCEI Accession number).
Over in PR #79, @datadavev asked the following, which I think we should resolve here:
In the third example, is the intent really to assert that the graph node of type identifier has id of "https://doi.org/10.1234/56789"? Seems that id value should really be on the Dataset class.
I'm guessing this is referring to this kind of encoding:
{
"@context": "http://schema.org",
"@type": "MedicalScholarlyArticle",
"@id": "http://identifiers.org/pubmed/29674413",
"identifier": [
{
"@type": "PropertyValue",
"propertyID": "pubmed",
"value": "29674413"
},
{
"@type": "PropertyValue",
"propertyID": "pii",
"value": "ajnr.A5646"
}]
}
I would think that if we were assigning an identifier to the sdo:identifier element, it would look like this:
{
"@context": "http://schema.org",
"@type": "MedicalScholarlyArticle",
"@id": "http://identifiers.org/pubmed/29674413",
"identifier":
"@id":"http://some.uri.com/ldskdgjrsr",
[
{
"@type": "PropertyValue",
"propertyID": "pubmed",
"value": "29674413"
}, ...
I don't see a problem with @johardi 's example (except for the propertyID identifiers...); the @type PropertyValue objects are values of the identifier element, consistent with the values expected for sdo:identifier.
@smrgeoinfo I'm not sure your example creates a valid JSON document.
I'd like to correct my example of using PropertyValue
for specifying identifiers by referring back to my original post.
"identifier": [
{
"@type": "PropertyValue",
"propertyID": "http://purl.org/spar/datacite/pmid",
"name": "PMID",
"value": "29674413",
"url": "http://www.ncbi.nlm.nih.gov/pubmed/29674413"
},
{
"@type": "PropertyValue",
"propertyID": "http://purl.org/spar/datacite/pii",
"name": "PII",
"value": "ajnr.A5646"
},
{
"@type": "PropertyValue",
"propertyID": "http://purl.org/spar/datacite/doi",
"name": "DOI",
"value": "10.3174/ajnr.A5646",
"url": "https://doi.org/10.3174/ajnr.A5646"
}
Notes:
propertyID
field can be either a prefixed string, a non-prefixed string or a URL that points to an external vocabulary or a web resource (see: https://schema.org/propertyID). The Example 8 and Example 9 demonstrate how to use the propertyID
field in PropertyValue
.identifiers
field can be used to provide additional or alternative identifiers (see: https://schema.org/docs/datamodel.html#identifierBg) to a resource that already has an @id
field. That looks great to me @johardi . It makes both the identifier value and resolution URI explicit, and links the identifier to a formal type which helps machines understand the resolution semantics properly. So I would be supportive of the syntax you list as being the best practice guideline. Although we should add in the final closing square bracket.
@rduerr stated that we should be more explicit in our guidance about the controlled vocabularies to use, so can we also come to agreement on that? I would propose that we state that providers:
schema:PropertyValue
schema:propertyId
for each identifier that links back to the identifier scheme using URIs drawn from the http://purl.org/spar/datacite/IdentifierScheme vocabulary or from identifiers.org registered prefixes from https://registry.identifiers.org/registry. If the identifier type does not exist in the SPAR datacite vocabulary or identifiers.org, use the best canonical URI for the identifier scheme that can be found.value
of the identifier (e.g., 10.3174/ajnr.A5646
), and the url
format, which can be repeated if multiple URIs exist (e.g., https://doi.org/10.3174/ajnr.A5646
and https://identifiers.org/doi:10.3174/ajnr.A5646
). When possible, the value
property should be expressed using its Compact URI format.So, here are two example that use common identifier schemes not in the DataCite vocabulary, and use the Compact URI format for values:
"identifier": [
{
"@type": "PropertyValue",
"propertyID": "https://registry.identifiers.org/registry/paleodb",
"name": "PALEODB",
"value": "paleodb:83088",
"url": "https://identifiers.org/paleodb:83088"
}
]
"identifier": [
{
"@type": "PropertyValue",
"propertyID": "https://registry.identifiers.org/registry/pdb",
"name": "PDB",
"value": "pdb:2gc4",
"url": "https://identifiers.org/pdb:2gc4"
}
]
Anything else? Is that strong enough? Change SHOULD to MUST?
Perhaps incorporate the imperative keywords from RFC 2119: Key words for use in RFCs to Indicate Requirement Levels, along with RFC 8174: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words.
Then, MUST applies to requirements that if not met, would render the subject useless for its intended purpose. Requirements with the SHOULD imperative add to the utility of the subject, but do not necessarily break systems should such recommendations not be met.
Is there a valid reason why an instance of identifier
would not be of class PropertyValue
? Similarly, are there valid reasons why the additional properties of the PropertyValue
may not be included?
Thinking about this more, perhaps the MUST's really only apply to things that would inhibit interoperability and re-use (the I and R in FAIR). I am also thinking that the guidance/specification and best practice text really should be clearly separated and identified as such (per #45). That says that for the most part the guidance can say Should but the Best Practice would have MUST. The point of the Best Practices (leading practices?) would be to maximize interoperability and re-use.
Proposing:
"identifier": [
{
"@type": "PropertyValue",
"propertyID": "https://registry.identifiers.org/registry/pubmed",
"name": "Pubmed ID #16333295",
"value": "pubmed:16333295",
"url": "http://www.ncbi.nlm.nih.gov/pubmed/29674413"
},
{
"@type": "PropertyValue",
"propertyID": "https://registry.identifiers.org/registry/doi",
"name": "DOI: 10.3174/ajnr.A5646",
"value": "doi:10.3174/ajnr.A5646",
"url": "https://doi.org/10.3174/ajnr.A5646"
},
{
"@type": "PropertyValue",
"propertyID": "https://registry.identifiers.org/registry/paleodb",
"name": "Paleo Database ID #83088",
"value": "paleodb:83088",
"url": "https://identifiers.org/paleodb:83088"
},
{
"@type": "PropertyValue",
"propertyID": "https://registry.identifiers.org/registry/pdb",
"name": "Protein Data Bank 2gc4",
"value": "pdb:2gc4",
"url": "https://identifiers.org/pdb:2gc4"
}
]
schema:value
is the prefixed identifier value. This is a standardized format.
schema:propertyID
is the registry.identifiers.org URI for the identifier scheme. This is a standardized format.
schema:url
is some resolvable url for that identifier. This is a standardized format.
schema:name
(optional) should not be just the name of the ID scheme (i.e. "DOI"), but something more descriptive of this specific identifier
Discussion on call on 2020-02-27. Decided that value
should include the namespace prefix (e.g., pdb:1234
or doi:10.xxxx/alksjdskj
)
ADMS is probably relevant here too.
not sure I can argue the ROR and GRID ids yet as looking at their websites, it's unclear how to qualify for one. .I tested requesting a GRID for BCO-DMO, and will report back. For now, I think we punt that specific change into a separate issue.
I found inconsistencies for specifying a well-defined identifier in the guides document and I'm wondering which one would you endorse.
In my mind, I'm thinking a structure that's something like this:
*) I found an example that uses
datacite:usesIdentifierScheme
**) Instead ofpropertyID
field