geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

PMID vs. PUBMED prefix for CURIE form of identifiers from https://pubmed.ncbi.nlm.nih.gov/ #2373

Open sierra-moxon opened 5 days ago

sierra-moxon commented 5 days ago

bioregistry issue - NCBI PubMed resource would like their prefix to be 'pubmed' rather than PMID (the issue has a lot of discussion for review). More votes encouraged and if anyone wants to take up the issue at Bioregistry or with NCBI that would be helpful.

This leads to an ambiguous namespace/identifier resolution for publication identifiers when GO software engineers try to use standard libraries (like curies, or prefixmaps to expand/contract PubMed URIs or CURIEs) for this task.

We do have a couple of controls that help us work around situations like this:

1) bioregistry has the concept of a "preferred" prefix and a set of synonyms. However, in the case of the PubMed resource, the bioregistry has chosen "pubmed" as the preferred prefix. This won't help us.

2) in the prefixmaps library, we create simple maps that represent prefix expansion rules inclusive of bioregistry synonyms. This does help us, for now.

For example, for PubMed, this is the prefixmap's merged representation of all the possible expansions of the prefix in question. We use the fourth column in this CSV to designate whether the expansions should be considered an alternative expansion "prefix_alias" - this is useful when ingesting data that might use URIs instead of CURIEs and we want to contract the URI to the correct CURIE - or a "canonical" expansion. This is useful when we are expanding CURIEs to their URI forms for use in RDF stores where the persistent identifier is a URI.

merged,PUBMED,http://bio2rdf.org/pubmed:,prefix_alias,bioregistry
merged,PUBMED,http://bioregistry.io/MEDLINE:,prefix_alias,bioregistry
merged,PUBMED,http://bioregistry.io/PMID:,prefix_alias,bioregistry
merged,PUBMED,http://bioregistry.io/PubMed:,prefix_alias,bioregistry
merged,PUBMED,http://europepmc.org/abstract/MED/,prefix_alias,bioregistry
merged,PUBMED,http://identifiers.org/pubmed:,prefix_alias,bioregistry
merged,PUBMED,http://linkedlifedata.com/resource/pubmed/id/,prefix_alias,bioregistry
merged,PUBMED,http://n2t.net/pubmed:,prefix_alias,bioregistry
merged,PUBMED,http://pubmed.ncbi.nlm.nih.gov/,prefix_alias,bioregistry
merged,PUBMED,http://purl.uniprot.org/citations/,prefix_alias,bioregistry
merged,PUBMED,http://purl.uniprot.org/pubmed/,prefix_alias,bioregistry
merged,PUBMED,http://rdf.ncbi.nlm.nih.gov/pubchem/reference/,prefix_alias,bioregistry
merged,PUBMED,http://scholia.toolforge.org/pubmed/,prefix_alias,bioregistry
merged,PUBMED,http://www.hubmed.org/display.cgi?uids=,prefix_alias,bioregistry
merged,PUBMED,http://www.ncbi.nlm.nih.gov/pubmed/,prefix_alias,bioregistry
merged,PUBMED,https://bio2rdf.org/pubmed:,prefix_alias,bioregistry
merged,PUBMED,https://bioregistry.io/MEDLINE:,prefix_alias,bioregistry
merged,PUBMED,https://bioregistry.io/PMID:,prefix_alias,bioregistry
merged,PUBMED,https://bioregistry.io/PubMed:,prefix_alias,bioregistry
merged,PUBMED,https://europepmc.org/abstract/MED/,prefix_alias,bioregistry
merged,PUBMED,https://identifiers.org/pubmed/,prefix_alias,bioregistry
merged,PUBMED,https://identifiers.org/pubmed:,prefix_alias,bioregistry
merged,PUBMED,https://linkedlifedata.com/resource/pubmed/id/,prefix_alias,bioregistry
merged,PUBMED,https://n2t.net/pubmed:,prefix_alias,bioregistry
merged,PUBMED,https://pubmed.ncbi.nlm.nih.gov/,prefix_alias,bioregistry
merged,PUBMED,https://purl.uniprot.org/citations/,prefix_alias,bioregistry
merged,PUBMED,https://purl.uniprot.org/pubmed/,prefix_alias,bioregistry
merged,PUBMED,https://rdf.ncbi.nlm.nih.gov/pubchem/reference/,prefix_alias,bioregistry
merged,PUBMED,https://scholia.toolforge.org/pubmed/,prefix_alias,bioregistry
merged,PUBMED,https://www.hubmed.org/display.cgi?uids=,prefix_alias,bioregistry
merged,PUBMED,https://www.ncbi.nlm.nih.gov/pubmed/,prefix_alias,bioregistry
merged,pubmed,http://bio2rdf.org/pubmed_vocabulary:,prefix_alias,prefixcc
merged,PUBMED,http://identifiers.org/pubmed/,namespace_alias,bioregistry
merged,PMID,http://identifiers.org/pubmed/,canonical,go

For GO software that uses prefixmaps to do URI/CURIE expansion/contraction, we should always instantiate prefixmaps with the "go" context, which simply reflects the db-xrefx.yaml annotations:

go,PMID,http://identifiers.org/pubmed/,canonical

I'm opening this ticket because bioregistry has a lot of uptake in our community and we need to keep an eye out for data coming into the GO with one of these alternate URI expansions (e.g. coming in with a PUBMED: CURIE, or with a URI like http://identifiers.org/pubmed/PUBMED:1234). These should fail our QC checks, but could start to be more common as resources move further towards bioregistry. (e.g. analogous example that does not impact GO: Alliance just moved all their instances of OMIM: prefixes to MIM: prefixes based on bioregistry discussions with OMIM).

It may be that we want to have a discussion at some point about how to ask for PubMed identifiers in our ingest files.

kltm commented 5 days ago

Also, as always, nothing the difference between linking (mostly what db-xrefs.yaml is concerned with) and identifiers, which are not always the same thing.

Tagging @pgaudet , to make sure this is on your radar as well, but no concrete action at this point.