biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
115 stars 49 forks source link

Add prefix AGRKB #359

Closed sierra-moxon closed 2 years ago

sierra-moxon commented 2 years ago

Prefix

agrkb

Name

Alliance of Genome Resources Knowledge Base

Homepage

https://www.alliancegenome.org

Description

The Alliance of Genome Resources creates identifiers for several biological entity types including genes, other sequence features, constructs, morpholinos, TALENs, CRISPRs, variants, alleles, genotypes, strains, environments and experiments, phenotype annotations, expression annotations, disease annotations, interactions, and variant annotations.

The Alliance of Genome Resources was founded by the following Model Organism databases and the Gene Ontology Consortium and distributes high-quality, curated knowledge about several model organisms in a single, unified location to support the model organism research communities and for the benefit of human health and medicine:

Contributing Knowledgebases:

Alliance-supported species: Saccharomyces cerevisiae (budding yeast) Caenorhabditis elegans (nematode) Drosophila melanogaster (fruit fly) Danio rerio (zebrafish) Xenopus laevis (African clawed frog) Xenopus tropicalis (Western clawed frog) Mus musculus (mouse) Rattus norvegicus (rat)

Example Local Unique Identifier

100000000000001

Regular Expression Pattern for Local Unique Identifier

^[1-9][0-9]{14}$

Redundant Prefix in Regular Expression Pattern

URI Format String

https://www.alliancegenome.org/accession/100000000000001

Contributor Name

Sierra Moxon

Contributor ORCiD

0000-0002-8719-7760

Additional Comments

cthoyt commented 2 years ago

@sierra-moxon can you provide an example URL that shows some information about this gene?

sierra-moxon commented 2 years ago

Not yet. Alliance reuses identifiers from other resources to link in their pages, e.g. https://www.alliancegenome.org/gene/HGNC:8616

But, they are in the process of minting new identifiers, and I want to make sure to reserve the identifier prefix so that its available when they need it.

cthoyt commented 2 years ago

@sierra-moxon AGR is already taken by https://bioregistry.io/registry/agricola. Please choose another prefix.

sierra-moxon commented 2 years ago

d'oh! @cthoyt :)

I see in identifiers.org, agricola has a prefix of 'agricola' -- is it possible to change the AGR agricola prefix in bioregstry to "agricola"?

(trying to get all the options here before taking it back to Alliance of Genome Resources to pick a new prefix).

cmungall commented 2 years ago

@cthoyt when you say "taken" do you mean that is their primary prefix? Or an alias?

How did these aliases get in to bioregistry? via identifiers.org? Did agricola explicitly request this?

I will use this issue to open a discussion with the agricola folks to see if they would be willing to reliniquish this alternate prefix

however more broadly this is something bioregistry needs to think about - if I am registering a new prefix do I just get to claim as many alternate prefixes as I want? and what if a prefix has been in use by a different community, what is the SOP for resolving this?

cmungall commented 2 years ago

aside: it looks like the agricola IDs don't even resolve?

I think the IDs should resolve to URLs like this https://agricola.nal.usda.gov/vwebv/holdingsInfo?bibId=1065631

But I don't see anything on the agricola site that indicates they refer to themselves as AGR!

Also nothing on the googles:

https://www.google.com/search?q=site%3Ausda.gov+agr+agricola

cthoyt commented 2 years ago

however more broadly this is something bioregistry needs to think about - if I am registering a new prefix do I just get to claim as many alternate prefixes as I want? and what if a prefix has been in use by a different community, what is the SOP for resolving this?

No, nobody gets to claim synonyms when they register prefixes (that would be total nonsense). One of the original goals of the Bioregistry was to provide a comprehensive index of all of the prefixes used throughout OBO Foundry ontologies and other resources consumed by PyOBO. This means I personally curated hundreds of synonyms and lexical variants of prefixes for different resources as I found them used in various resources and mapped them back to an internal standard (in addition to mapping to external registries (MIRIAM, Prefix Commons, etc.) which also had lots of variation).

I didn't keep a full manifest of which resource uses which synonyms, but one of them uses AGR as a synonym for agricola, and that's why it's curated as a synonym.

As it stands, the Bioregistry has zero conflicts between prefixes and synonyms. There is a technical CI test in place to ensure this so it doesn't happen by accident. This is the first request that would create one.

Since Alliance of Genome Resources does not already have their own prefix claimed, it wouldn't be fair for them to be able to say that other people's uses of AGR are invalid (through the scope of the bioregistry), especially so because this is a request to "park" a prefix that does not provide a working endpoint for resolving them.

Here are two options going forwards:

  1. We can look back into the whole PyOBO stack to figure out which resource had AGR references and try to clean them up (i.e., canonicalize them in the original resource). I've been working on code for ontology quality assurance that warns when non-canonical Bioregistry prefixes are used in each OBO foundry ontologies' xrefs and provenance fields. If this synonym is no longer needed for a full build of the Biolookup service and Inspector Javert's Xref Database (which are built with PyOBO) then we can retire the synonym and you can have it.
  2. Pick a different prefix. You can still use AGR in identifiers.org and GO prefix registry but the Bioregistry will need a different prefix that gets mapped to those
  3. I'm open to suggestions

Note - you don't need to email the agricola people, they did not "claim" this prefix as a synonym. Unlike OBO Foundry, the Bioregistry operates without the consent of the resources themselves (though advice is welcome) and is trying to be a practical and useful description of the reality of prefixes and identifiers.

sierra-moxon commented 2 years ago

@balhoff did a nice SPARQL query on ubergraph (doesn't have all obo ontologies, but many), and found that CHEBI has a lot of AGR prefixed links.

ubergraph sparql query

sierra-moxon commented 2 years ago

Stacia and Edith from SGD also noticed that EuropePMC also uses AGR = agricola https://europepmc.org/Help

cthoyt commented 2 years ago

@sierra-moxon thanks for looking back into that. I've had a really hard time petitioning ChEBI for changes, and it seems even less likely to get EuropePMC to make changes. What are your thoughts? Would you consider a different prefix for alliance? how about alliance.gene?

khowe commented 2 years ago

NCBI are already using "AllianceGenome:" as a prefix for us (although informally, given that it has not been registered). See for example https://www.ncbi.nlm.nih.gov/gene/176291 ("See related" in Summary box). This is quite long/bulky, but perhaps this doesn't matter. Would "AllianceGenome" be an acceptable alternative to "AGR"?

cthoyt commented 2 years ago

@khowe this is a bit problematic since Bioregistry requires (for lots of good reasons) only lowercase prefixes, so it would read as alliancegenome. This can be fixed with a dot delimiter to alliance.genome. Additionally, this namespace is about genes and not genomes, so it's misleading.

sierra-moxon commented 2 years ago

@cthoyt - it's an interesting question about genes vs. genomes. Alliance will have all sorts of pages (allele, genotype, gene, variant, etc.).

But resources can manage this redirection internally without the prefix changing for each new "type" of identifier.

khowe commented 2 years ago

@cthoyt the example in the original ticket gives a gene, but I think we are intending for the prefix to be used for many (if not all) entities resolvable by the Alliance of Genome Resources portal. As @sierra-moxon says, we have lots (and will have lots) of different entity types.

sierra-moxon commented 2 years ago

And to be clear for Alliance - do you have a process in place in bioregistry for supporting non-lowercase prefixes (NCBIGene vs. ncibgene) as aliases?

cthoyt commented 2 years ago

@sierra-moxon yes, there's a preferred_prefix field for adding casing, but this is purely cosmetic information. It's a little late in the day for me to write a rant about why I casing is bad, so I will save it for later.

If you want a prefix that can resolve lots of entity types then I'd suggest just alliance. However, I don't suggest doing this since it makes it very hard to reuse content annotated with this kind of identifier. Even worse, I already see that you have entity types inside the identifiers, which is really really problematic as well and shouldn't be there.

sierra-moxon commented 2 years ago

I'm not sure I know all the reasons why entity types shouldn't be encoded in identifiers, but I do have experience trying to handle an object that was typed in its identifier as a Gene and then had to become a Pseudogene typed identifier and it was painful.

cthoyt commented 2 years ago

Further reading in https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2001414 for how to mint good identifiers.

khowe commented 2 years ago

I take the point that "alliance.genome" might be misconstrued to the "genome" sub-domain of all "alliance" identifiers. "alliancegenome" would be a bit better in that respect.

But as to your other point @cthoyt , Bioregisty already has many entries representing resources that have a single prefix for many entity types. e.g. the ZFIN entry (https://bioregistry.io/registry/zfin), which states in the description that it applies to many entity types.

If your point is rather that "alliance" is too general (since there are lots of "alliances"), then a possible alternative (that don't mention genome) is something like "all.gen.res" (and variants)?

sierra-moxon commented 2 years ago

From discussions at Alliance this morning, they say AOGR works for them as well (Alliance OF Genome Resources).

cthoyt commented 2 years ago

Bioregistry inherits a huge amount of baggage from past decisions that we don't have any control over, both on the provider side (e.g., ZFIN) and on the registry building side (e.g., Identifiers.org), so I would be careful to justify doing things one way or another just because someone else did before.

Not sure if anyone has ever done a double dot in a prefix but I'm going to executive veto ever doing something like that... way too complicated. I like aogr

sierra-moxon commented 2 years ago

ok, I'll update this issue to request AOGR.

cthoyt commented 2 years ago

@sierra-moxon it appears that the generation of AGOR identifiers is still in flux, so I think it would be a good time to mention that you should also remove the redundant prefix in the local unique identifier. Further, it's currently the case that the requested example identifier and the regular expression don't match.

sierra-moxon commented 2 years ago

From talking with the Alliance, they came to this identifier paradigm as a consensus between 6 large id-minting organizations and want to stick with it at the moment. (I fixed the regex, I believe).

cthoyt commented 2 years ago

@sierra-moxon can you please provide other examples of other identifiers that do not have gene inside them? I also don't think that using a \w in the pattern ^AOGR\w+$ will do justice to potential users - this should be more specific. Same thing about being more specific about the length of this zero-padded number

jdepons commented 2 years ago

@cthoyt I was asked by our PI group if A, AL, or ALL would be available as a prefix. I do not see it in the registry, but what to make sure these are not synonyms as AGR was.

jdepons commented 2 years ago

@cthoyt Apologies, they asked about AR as well.

cthoyt commented 2 years ago

@jdepons is this related to the AGOR prefix request or just a general inquiry? The Bioregistry won't accept requests for 1- or 2-letter prefixes and I'd probably suggest not using ALL since it's a word

jdepons commented 2 years ago

@cthoyt Yes, this is in regard AOGR. Members of our PI group do not like that prefix so we are discussing other options.

sierra-moxon commented 2 years ago

Hi Charlie - after PI discussion, I've updated this request accordingly. :) AGRKB is the new requested prefix, with a base URL of www.alliancegenome.org/accession/

cthoyt commented 2 years ago

Thanks for making these updates, I think this prefix is fine. I assume the KB means knowledge base, so we can update the title accordingly. Similarly, the description field should not describe the organization, but the semantic space. What kind of things are in it? Who should use it? Etc. However, if those questions are prominently answered then there’s no issue with also including information about agr itself too.

Looks like alliancegenome.org/accession/100000000000001 gets a 404, too. Can you double check this page is working and also update the uri format string to use the appropriate subdomains and either https or http please? Thanks!

sierra-moxon commented 2 years ago

I updated accordingly - note this is still a prefix parker - Alliance does not currently support AGRKB for publically available pages, but has unified on the prefix and curie expansion listed in this ticket.

cthoyt commented 2 years ago

@sierra-moxon those improvements look great! last question before we finish this is who is the primary contact person? I need their ORCID/email/github handle

sierra-moxon commented 2 years ago

One last update above in the about section for your review. Would it make sense to use the "helpdesk" email address for this contact person? (that way, as people migrate, we are not left with a stale contact). The helpdesk is not ever going to go away. help@alliancegenome.org

cthoyt commented 2 years ago

Most definitely not. This needs to be exactly one main responsible person. Ideally this would be an email address that won't go stale even if they're not responsible anymore, so in case we need to get in touch they can mediate updating the metadata.

sierra-moxon commented 2 years ago

@cmungall volunteered his email address cjmungall@lbl.gov 0000-0002-6601-2165 for this.

cthoyt commented 2 years ago

@sierra-moxon thanks for bearing with me through all of this discussion, I'm quite happy with the result and your prefix is now merged in. It'll appear on the website with the nightly build at the end of the day