biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
120 stars 53 forks source link

Use dashes for partial EC Curies #681

Open cmungall opened 1 year ago

cmungall commented 1 year ago

there are three ways of encoding a partial enzyme code (EC) as an identifier, ie. a non-leaf node:

  1. EC:1.-.-.- (e.g., as used in GO)
  2. EC:1 (e.g., as used in IntEnz)
  3. EC:1.4.* (e.g., as used in ChEBI)

I prefer the second, which is currently enshrined in the Bioregistry.

however a lot of resources use the first form, including GO, uniprot, many model organism databases (MODs)

There is some me background in this (confusing) issue here https://github.com/geneontology/go-ontology/issues/17563, where in GO we decided to go with dashes

most resolvers seem to accept either

how should we resolve this? I don’t think the Bioregistry should be pluralistic. There should be one way to write an ID.

I suggest using this issue to solicit an authoritative answer from the source

partial summary of discussion so far, since this thread is gettin long Landscape of webpages, and what they resolve: | Name | Example Specific Enzyme | Example Enzyme Class | Resolves with Dashes | Resolves without Dashes | Resolves with astericks | | ------ | -------------------------------------- | ------- | --- | --- | --- | | Expasy | https://enzyme.expasy.org/EC/1.1.1.1 | https://enzyme.expasy.org/EC/1.1.1.- | Yes | Yes | Yes | | ExplorEnz | https://www.enzyme-database.org/query.php?ec=1.1.1.1 | https://www.enzyme-database.org/query.php?ec=1.1.1.- | Yes | No | Yes | | BRENDA | https://www.brenda-enzymes.org/enzyme.php?ecno=1.1.1.1 | https://www.brenda-enzymes.org/enzyme.php?ecno=1.1.1.- | Yes | No | No | | KEGG | https://www.genome.jp/dbget-bin/www_bget?ec:1.1.1.1 | N/A | No | No | No | | BioCyc | https://biocyc.org/META/NEW-IMAGE?type=EC-NUMBER&object=EC-1.1.1.1 | https://biocyc.org/META/NEW-IMAGE?type=EC-NUMBER&object=EC-1.1.1 | No | Yes | No | | IUBMB Site at University of London| https://iubmb.qmul.ac.uk/enzyme/EC1/1/1/1.html | https://iubmb.qmul.ac.uk/enzyme/EC1/1/1/ | No | No | No |
cthoyt commented 1 year ago

I’m personally partial towards not including extra dashes and dots.

Given the bioregistry doesn’t currently list a contact person for EC, we should first identify an authoritative person associated with this resource and include them in discussion on this issue.

cthoyt commented 1 year ago

@AmosBairoch given you were the author of the 2000 paper The ENZYME database in 2000., do you know the current responsible person is for EC codes?

Update: Amos retired in 2024, so he probably isn't the right person to keep messaging

pgaudet commented 1 year ago

The issue for GO at least was how to make links to resources such as

https://enzyme.expasy.org/EC/1.14.19.-

If we dont have dashed the links dont work.

hdrabkin commented 1 year ago

As I remember, o when we removed the dash, we got complaints because when loaded into Amigo, the link would not resolve at EC (ie, EC:1.1.1 would return nothing, but EC:1.1.1.- would.

cmungall commented 1 year ago

It seems IntEnz is happy to resolve either, and it is the default for bioregistry and n2t. Expasy needs the dashes. (Kegg resolves neither)

hdrabkin commented 1 year ago

Hmm. I'm always using Expasy myself; I guess I always assumed that was official (enzyme.expasy.org).

kaxelsen commented 1 year ago

Given the bioregistry doesn’t currently list a contact person for EC, we should first identify an authoritative person associated with this resource and include them in discussion on this issue.

@cthoyt I have taken over the task of maintaining EC data at SIB. I'm also working with Rhea creating reactions, and I'm member of the IUBMB nomenclature committee (https://iubmb.org/about/committees/nomenclature-committee/) that is responsible for maintaining the EC system.

I don't think the nomenclature committee has an opinion about how to write partial EC numbers, so I will below describe how we do it at SIB.

In UniProtKB we use dashes when we refer to single enzymes because no dashes describe classes (EC 2), subclasses (EC 2.1) and sub-subclasses (EC 2.1.1). In UniProt dashes in an EC number can either signify the exact enzyme activity is not known, but that the enzyme class or (sub-)subclass is (often based on sequence similarity), or it can signify that the activity is known, but that no suitable EC number exists for this activity.

The official database for EC numbers is ExplorEnz: https://www.enzyme-database.org/ Expasy hosts the SIB version of the EC nomenclature. The SIB version has the advantage of using Rhea reactions to describe the reactions whereever possible, but this is currently not apparent from the Expasy site. A major update of the web site is in the pipeline. IntEnz is no longer maintained, but is occasionally updated with data provided by SIB.

hdrabkin commented 1 year ago

I just tried searching at ExplorEnz, using EC:1.1.1.1, EC:1.1.1.-, and EC:1.1.1, but none of them return anything if I use the "look up EC number; also tried not using the EC: prefix, but again, no results. Maybe I'm using it wrong.

kaxelsen commented 1 year ago

Try 1.1. (I used 7.1.) image

pgaudet commented 1 year ago

I see -- for eg https://www.enzyme-database.org/query.php?ec=1.14.*

This seems to list the enzymes that correspond to these 2 digits.

Is there any way to get the info we get in Expasy?

ENZYME class: 1.14 Oxidoreductases Acting on paired donors, with incorporation or reduction of molecular oxygen The oxygen incorporated need not be derived from O2.

This is very useful to GO editors.

Thanks, Pascale

kaxelsen commented 1 year ago

There is a tab called "Enzymes by class". See the image above.

pgaudet commented 1 year ago

But is there a way to

  1. Make a direct link from GO? The URL is https://www.enzyme-database.org/class.php?c=1&sc=1&ssc=*, so would we need a different syntax for partial ECs?
  2. I could not get beyond 2 digits using this tool.
kaxelsen commented 1 year ago

I suggest you write to Dr Andrew McDonald (amcdonld@tcd.ie; https://orcid.org/0000-0003-2727-176X) who is responsible for the ExplorEnz database. He might be able to help you find a way.

cmungall commented 1 year ago

@kaxelsen:

In UniProtKB we use dashes when we refer to single enzymes because no dashes describe classes (EC 2), subclasses (EC 2.1) and sub-subclasses (EC 2.1.1). In UniProt dashes in an EC number can either signify the exact enzyme activity is not known, but that the enzyme class or (sub-)subclass is (often based on sequence similarity), or it can signify that the activity is known, but that no suitable EC number exists for this activity.

I see, so you use the dashes as a proxy for an "unknown" or "unspecified" subclass. IMO this is unecessary, just as you can annotate to a more general GO term if you don't know a specific subclasses, you should be able to do the same with EC.

But let's assume we are trying to all use the IDs in the same way and be consistent with uniprot semantics.

Right now in GO when we may grouping EC classes we use dash nomenclature, for example:

id: GO:0016491
name: oxidoreductase activity
namespace: molecular_function
def: "Catalysis of an oxidation-reduction (redox) reaction, a reversible chemical reaction in which the oxidation state of an atom or atoms within a molecule is altered. One substrate acts as a hydrogen or electron donor and becomes oxidized, while the other acts as hydrogen or electron acceptor and becomes reduced." [GOC:go_curators]
synonym: "oxidoreductase activity, acting on other substrates" NARROW []
synonym: "redox activity" EXACT []
xref: EC:1.-.-.-
is_a: GO:0003824 ! catalytic activity

I don't think this is quite consistent with your semantics, since it sounds like we are saying "GO:0016491 is mapped to a some subtype of EC:1, we just aren't specifying which one". But in fact we want to say that GO:0016491 and EC:1 are at the same level.

My own preference is that we deprecate all uses of dashes. They don't serve any purpose, and pose an interoperation barrier. If you want to talk about a general class or an unspecified subclass just use an ID such as "EC:1". The meaning comes from the context. If a uniprot ID is annotated to EC:1 then we know that this just means it has some kind of oxidoreductase activity, it's just unknown or unspecified, just the same way we interpret a uniprot ID annotated to GO:0016491.

However, if we as a community don't like this and we feel the dashes serve a purpose, then we all need to use them with consistent semantics. If we feel that there is a difference between EC:1.-.-.- and EC:1

Once we as a community agree (and of course if this our agreement is consistent with the intended semantics of EC) then we should make the database portals work in the correct way

What we should not be doing is making different decisions based on the behavior of different database search interfaces because as we have seen this breaks interoperability. We should not be confusing search parameters like * with actual portions of identifiers.

kaxelsen commented 1 year ago

I think the decision of whether or not to use dashes in EC numbers is not for me to decide. The use of dashes in EC numbers in UniProt entries (and in the ENZYME database) has a long history (it predates the start of GO) so a change will need higher level decision.

Regarding ENZYME, if you look at enzclass.txt the families, subfamilies, and sub-subfamilies are listed like this:

  1. -. -.- Oxidoreductases. 1.1. -.- Acting on the CH-OH group of donors. 1.1.1.- With NAD(+) or NADP(+) as acceptor. 1.1.2.- With a cytochrome as acceptor. 1.1.3.- With oxygen as acceptor. 1.1.4.- With a disulfide as acceptor.

So, of course starting to use no dashes for classes would help to separate "unknown" (no dashes: we do not know more) from "unspecified" (dashes: we do know the activity, but the appropriate EC number does not exist) but as stated above, it is not a change I can decide should be made.

cthoyt commented 1 year ago

@kaxelsen could you please enumerate the exact list of individuals who could be responsible for that decision (wrt the "higher level" you mentioned) / some context about why they have that authority?

AmosBairoch commented 1 year ago

The IUBMB Enzyme Nomenclature Committee (https://web.archive.org/web/20241009101716/https://iubmb.qmul.ac.uk/enzyme/). And the authority stems from the fact that this is a IUPAC-IUBMB committee. And historically IUPAC inherited from the responsabilty of naming enzymes back in the late 1950s

You can find quite a number of articles on the history of the enzyme nomenclature

Best Amos

Sent from my Galaxy

-------- Original message -------- From: Charles Tapley Hoyt @.> Date: 3/6/23 15:36 (GMT+01:00) To: biopragmatics/bioregistry @.> Cc: Amos Bairoch @.>, Mention @.> Subject: Re: [biopragmatics/bioregistry] Use dashes for partial EC Curies (Issue #681)

@kaxelsenhttps://github.com/kaxelsen could you please enumerate the exact list of individuals who could be responsible for that decision / some context about why they have that authority?

— Reply to this email directly, view it on GitHubhttps://github.com/biopragmatics/bioregistry/issues/681#issuecomment-1456255784, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACDUI4WGLGIRVPOXBC45PWTW2XY7FANCNFSM6AAAAAAS377XHU. You are receiving this because you were mentioned.Message ID: @.***>

kaxelsen commented 1 year ago

If you wish to have changes made to the way the UniProtKB database depicts partial EC numbers, I suggest you contact Alan Bridge who is the Director of the Swiss-Prot group.

As any potential change to the way we present data needs a thorough discussion and a lot of preparation (if we decide to do the change), Alan would be involved and he would know who else should be involved in the decision.

redaschi commented 1 year ago

The only realistic/fast way forward is to contact the sites to which you want to resolve and ask them to accept all existing syntax variations. We can certainly add Apache redirects to the enzyme.expasy.org config to make EC classes without the dashes resolve.

UniProtKB is not likely to change the EC format, there is no obvious gain for most users and the cost of removing these dashes is high (both for us and our users), it'll never make it up the priority list. TrEMBL gets part of its partial ECs from the INSDC, e.g. https://www.ebi.ac.uk/ena/browser/api/embl/QNO53797.1?lineLimit=1000, so you'd have to convince them, too, plus I don't know how many other databases that have historically added these dashes.

cmungall commented 1 year ago

OK, based on the above, I propose that we adopt dashes as the standard, and we recommend that whenever referring a level<4 EC number dashes are always included.

cmungall commented 10 months ago

I think we are agreed that levels 1-3 should use dashes?

cmungall commented 8 months ago

@cthoyt I think we have agreement here, dashes are the standard for levels 1-3?

cmungall commented 2 weeks ago

@cthoyt @bgyori - I think we are agreed that levels 1-3 should use dashes?

cthoyt commented 2 weeks ago

Sorry @cmungall but I don’t think we’ve yet engaged a group of authoritative voices who can help guide us towards an “official” decision. I think the best way forwards is to get one or more people who are willing to actively participate in this discussion, demonstrate that they’re an authority, and ultimately take responsibility for the correctness of the Bioregistry record.

I wouldn’t say I’ve agreed to anything for now. I think given the large disagreement here and heterogeneity in how this is done in practice, I’m not sure it would be wise to make your proposed change in the Bioregistry yet.

bgyori commented 2 weeks ago

My read of the conversation above is that setting aside stylistic preferences (it sounds from the above like @cthoyt @cmungall and I all agree not having dashes is more appealing), the "reality on the ground" is that dashes are used in key resources referencing these IDs and therefore considering the use of dashes to be valid is the pragmatic way forward.

JervenBolleman commented 2 weeks ago

@redaschi asked me to have a look if I could find why we at UniProt use the dash approach. I can answer: that is how it has been since the beginning of our RCS, CVS and GIT repositories. i.e. introduced sometime before 1995. This continued into the UniProt RDF since it's first code/data check-in.

cthoyt commented 2 weeks ago

Based on the above comment from Amos that the IUBMB Enzyme Nomenclature Committee is the authority on this topic and Kristian's suggestion to reach out to Andrew McDonald, I did some internet sleuthing to identify some members of that organization and reached out to them by email (I don't believe any of them are on GitHub).

I'll report when I hear back from them and try to get them to join this public discussion on GitHub.

kaxelsen commented 2 weeks ago

I am member of the IUBMB nomenclature committee (https://iupac.qmul.ac.uk/jcbn/membr.html), and I don't think that IUBMB can help you make SIB or other sites change the way they write partial EC numbers in their data. I suggest you note what Nicole Redaschi wrote above. She is Head of Software Development in the Swiss-Prot group.

ialarmedalien commented 2 weeks ago

Just for another perspective (because everyone wants to hear more opinions on this issue!): back when I was curating GO, EC:1.15 would be used for the term that was equivalent to the EC grouping term (e.g. "oxidoreductase, acting on superoxide as acceptor"), and EC:1.15.-.- would be used for an enzyme activity that could be classified to that level but for which further classification was not possible. The most common case would be where the last part was a - as EC didn't yet have an entry for the enzyme but a paper or some other resource (e.g. MetaCyc) had provided a partial classification.

Looks like GO backfilled the partial ECs representing the EC categories so all that beautiful nuance has been lost...

cmungall commented 6 days ago

@ialarmedalien - in fact GO is using SSSOM/skos predicates to capture whether the relation is exact/broad/narrow, so no need for a convention with the IDs!