biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
115 stars 49 forks source link

PubMed preferred prefix should be PMID #323

Open matentzn opened 2 years ago

matentzn commented 2 years ago

As per conventions and pubmed itself, pubmed IDs should be prefixed with PMID:

Example page with a PMID: https://pubmed.ncbi.nlm.nih.gov/35189623/

Blocked By

cthoyt commented 2 years ago

There are several reservations about this requst:

  1. The preferred prefix can only be a capitalization variant of the canonical prefix, so we'd have to change both. This is also technically possible so this is more of a note about the ramifications of the reguest.
  2. The Bioregistry's "good prefix guidelines" encourage clear, transparent prefixes when possible, and explicitly discourages using redundant "ID" as part of a prefix. https://github.com/biopragmatics/bioregistry/blob/main/docs/CONTRIBUTING.md#choosing-a-good-prefix
  3. It doesn't appear that PubMed has a very principled approach to prefixes or writing CURIEs. I would not consider what's shown on a given article page as their recommendation. Two major issues:
    • They include a space in their "CURIE"
    • They write the "prefix" for PubMed Central Identifiers as PMCID, which does not make much sense either.
  4. The NCBI has a track record of making poor choices for its identifier recommendations. c.f. the NCBITaxon debacle - they explicitly recommend writing the CURIE for Homo sapiens as NCBI:txid9606. I think everyone agrees that there are issues with this, and that we don't always have to follow the recommendations.

Assessing the Community

I think it's very difficult to say if there actually is a consensus on what prefix should be used for PubMed, and if so, what that is. There's lots of different camps for PMID vs. pmid vs. pubmed (and even some who use MEDLINE).

We can take a look at the Bioregistry's page for PubMed to see which registries use which prefix. GO is actually the only external registry Bioregistry aligns on that uses PMID and not pubmed as the prefix. Identifiers.org, Prefix Commons, N2T, and others use pubmed. The Bioregistry primarily inherits from Identifiers.org, so this is why pubmed is the existing Bioregistry prefix.

Anyone who standardized on Identifiers.org will therefore be using pubmed as their prefix for PubMed. A few specific communities come to mind (I will update this list):

Further, anyone who has already started standardizing based on the Bioregistry will be using pubmed.

Aside PubChem's RDF uses reference as the prefix for PubMed (ref: https://pubchem.ncbi.nlm.nih.gov/docs/rdf-uri). I haven't been able to find any competing NIH RDF resources with PubMed in it that aren't PubChem

Impact of Change

A big question remains: if we change something so widely used in the Bioregistry, then all of these people would have to update their data too.

Blockers

NCBI Invovement

The Bioregistry does not list a contact person for PubMed. I think it would be valuable to identify an individual from the NCBI who can participate in this discussion and authoritatively speak on the issue.

Alternate Solutions

For people who want to immediately use PMID as the default prefix (or in any other case where you have a disagreement with the Bioregistry's defaults), there are several different ways to generate a custom extended prefix map from the Bioregistry:

matentzn commented 2 years ago

Hmm.. Usually your arguments convince me more.. You are trying to pitch personal aesthetic preference against current practice and even turn against PubMed as a whole and their own choice of prefix. I am not swayed (yet). orcid also has the ID in it. I still think PMID should be canonical, but I am happy to change my mind of new arguments arise.

sierra-moxon commented 11 months ago

I also would strongly favor adding PMID as the preferred prefix. It’s just the prefix that has been used in many applications forever and publicized on Pubmed itself.

caufieldjh commented 11 months ago

People in Zhiyong Lu's group at NCBI should be qualified to comment: https://www.ncbi.nlm.nih.gov/research/bionlp/Team Rob Leaman in particular.

sierra-moxon commented 11 months ago

@cthoyt - I just want to be clear that while I think we could advocate for PMID as the canonical prefix for Pubmed, I just really want at least the preferred prefix to be PMID (accepting that bioregistry will have strict naming conventions for prefix itself, but allowing the historical prefix to be used computationally in a generic way, without having another source of truth -- e.g. many bits of code everywhere to convert PMID -> pubmed or vice versa). Does adding this as a preferred prefix make sense?

cthoyt commented 10 months ago

I chatted in PM with @sierra-moxon and she agreed to take the lead in writing up a more structured set of arguments to support changing from pubmed to pmid. After that appears and someone can get the appropriate NCBI people actively involved in this discussion on GitHub (#966), we can give a few weeks for follow-up discussion, then the Bioregistry Review team (including @megbalk, @callahantiff, and @lubianat) can make a decision.

cthoyt commented 10 months ago

@rleaman can you help us identify a responsible individual for PubMed that can join our public discussions on GitHub about how to best reference PubMed identifiers?

rleaman commented 10 months ago

The person at NCBI who could most authoritatively comment on the preferred prefix / CURIE for PubMed would probably be in engineering. I'll figure out who that would be (I am in research) and follow up.

My opinion, for what it's worth: this seems like a case of a reasonable standard (e.g. "ID" shouldn't be part of the prefix) conflicting with a case ("PMID") that is probably both (1) better known than the standard and (2) predates the standard (e.g. https://pubmed.ncbi.nlm.nih.gov/15048644/). But I don't think that "pubmed" is unclear, and I don't have a good sense for how many people are using each one overall.

While the literature isn't the best use case for CURIEs, we can use it to try to get a sense of what's actually used: my best guess is that "pmid" is over 10x more popular than "pubmed." [Specifically: the number of times that "pmid" appears followed by a colon or a 7- or 8-digit number is 27,372. The number of times that "pubmed" appears followed by a colon or a 7- or 8-digit number is 2,037. Data is for case-insensitive bigram counts of PubMed and the PMC text mining subset, through early 2020.]

jeffbeckncbi commented 10 months ago

I've made some inquiries here at NCBI. For PubMed, we would prefer pubmed:123456 rather than using the more obscure "PMID". This is consistent with the pmc:PMC5678910 that we already discussed where the resource is the prefix.

The difference between pubmed and PMC (and most other ncbi databases) is that the PubMed ID is just an integer. They do not define the Accession ID structure like we have in PMC (PMC999999.9). So pubmed:45678910 would be the best option.

matentzn commented 10 months ago

@jeffbeckncbi thank you for inquiring. As this is an extremely consequential and costly decision I would really like to know who is "we" in "we would prefer" and what steps NCBI is taking to replace their own usage of PMID in all their websites and resources with pubmed:123.

Is their a concrete plan to depreciate use of PMID across the organisation?

jeffbeckncbi commented 10 months ago

@matentzn I answered this question about the prefix for a PubMed CURIE as a followup to my response about PMC CURIEs (https://github.com/biopragmatics/bioregistry/issues/965)

There is no intention to change the label on the pubmed identifier on the pubmed site to use CURIEs, but if you are trying to write CURIEs for both the pubmed and pmc resources, identify the resource in the prefix and don't just use the abbreviation for pubmed id.

I am the Program Head for Literature at NCBI - the group that runs PubMed and PMC at the US Library of Medicine. And I consulted on the question of CURIE prefix for these resources with NCBI leadership

matentzn commented 10 months ago

Thank you for the clarification, I didn't see that discussion - followed up now. I will come back to you soon!

cthoyt commented 10 months ago

@jeffbeckncbi thank you, having an authoritative voice on this is incredibly valuable.

@sierra-moxon It's still the Bioregistry Review Team policy to weigh all arguments, even those contrary to the Identifier Space Owner (ISO). If you are still willing to write up a more detailed argument (I mentioned in https://github.com/biopragmatics/bioregistry/issues/323#issuecomment-1772996218 that you had already agreed to do this), then the Bioregistry Review Team can consider this. If you're still interested in doing that, do you think you could do it by the end of this week?

sierra-moxon commented 10 months ago

I think @rleaman's simple search for the prefix in the corpus of publications before 2020 in this thread paints a good picture of the usage and I imagine others on this thread to be better than I at justifying. To clarify again, my ask on this ticket was to simply add a Bioregistry preferred annotation to PMID (or otherwise distinguish PMID from the other namespace/prefix synonyms).

Here are several more resources (besides the Gene Ontology) that use PMID as a namespace in pubmed identifiers: