geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
216 stars 39 forks source link

Coordinate with other databases on migrating mappings to obsoleted ECs #27997

Open cmungall opened 1 month ago

cmungall commented 1 month ago

We recently created procedures to ensure that we map to the most up to date ECs. This is in principle good, but in practice if we are too ahead of other databases, this is detrimental.

For example, 2 years ago in

we noticed EC:3.6.4.12 (DNA helicase) has been transferred to:

This is reflected in enzyme-database.org: https://www.enzyme-database.org/query.php?ec=3.6.4.12

However, expasy.org has no indication that this EC is retired, despite it being at least 2 years since : https://enzyme.expasy.org/EC/3.6.4.12

Furthermore, UniProtKB still has quarter of a million annotations to this EC:

https://www.uniprot.org/uniprotkb?query=ec:3.6.4.12

This is perhaps not surprising: just as it is expensive to re-annotate a do-not-annotate GO term to a choice of two more specific terms, this would be the same for EC.

Note that this means since 2022 we are missing propagations for 277k genes. In many cases there is not an equivalent annotation as granular.

Proposal:

  1. We retain mappings to obsoleted ECs, if other databases have not migrated. We can of course still tag the mapping as to an obsolete entry (we do this in mondo all the time for analogous situations).
    • exceptions can of course be made if we feel the annotations that would be propagated will be redundant, too high level, etc
  2. We coordinate with uniprot, expasy.org, and even EC themselves on migration policies, timing, announcements

cc @sjm41

pgaudet commented 1 month ago

Hi @kaxelsen

Do you know why UniProt and expasy do not yet reflect these changes? The EC has been updated 2 years ago (see https://github.com/geneontology/go-ontology/issues/23533)

Thanks, Pascale

kaxelsen commented 1 month ago

The reason is that EC 3.6.4.12 was transferred to two different new EC numbers and was associated to 2700 Swiss-Prot entries. As there are two new EC numbers the update of the Swiss-Prot entries is not just a simple search and replace, but a job that requires update of every single family of helicases, and this work has been on-going since June 2022 and STILL is!!!

The two new EC numbers are already available and I suggest that you already now start citing those and stop using EC 3.6.4.12. I should of course have asked you to do that 2 years ago, but it is never too late.

cmungall commented 1 month ago

The two new EC numbers are already available and I suggest that you already now start citing those and stop using EC 3.6.4.12. I should of course have asked you to do that 2 years ago, but it is never too late.

@kaxelsen in fact we did this 2 years ago:

The mappings to the child terms are fine. But the removal of the mapping to the parent is precisely the problem I am pointing out in this issue. We did this ahead of other databases and now we are lacking propagations (it turns out that in this particular case the propagations would be incorrect as there are many misannotations of uniprot IDs to EC:3.6.4.12 but we would not expect this to be true more generally)

As a general rule we should retain mappings to obsoletes (marking as such) until all contributing databases have caught up

sjm41 commented 1 month ago

So, in this case, I guess we have the option of adding back the EC:3.6.4.12 mapping to the parent term GO:0003678 DNA helicase activity, and thus have all three mappings in the GO (for now):

GO:0003678 ! DNA helicase activity [ EC:3.6.4.12 ]
    |_GO:0043138 ! 3'-5' DNA helicase activity [ EC:5.6.2.4 ]
    |_GO:0043139 ! 5'-3' DNA helicase activity [ EC:5.6.2.3 ]

That solution would allow EC2GO mappings for proteins in UP annotated to either the old EC:3 term or the new EC:5 terms (I see there is a mixture right now).

Would that be the 'correct' solution here, for the time-being?

A more general point is whether GO should use ExplorEnz or EXPASY as it's 'source of truth' for EC (as discussed before) - in this case, ExplorEnz tells us that EC:3.6.4.12 is 'deleted/transferred' (since 2021), whereas EXPASY tells us it's still valid (as Chris said). I think this is an unusual case though.

pgaudet commented 1 month ago

From @kaxelsen

The helicases are a very special one-of-a-kind case of the reasons I told you about. If there had been a one-to-one transfer of the EC number, it would have been easy just to make a search and replace, but it is not possible in this case, as there are two new EC numbers. Part 2 is that there are exceptionally many cross-references to Swiss-Prot and many of the entries there have not been updated for a very long time, so that is why it has taken so long to find out which of the two new EC numbers they should be annotated with.

So, be assured, this is a one-off, that is if you include RNA helicases that will result in the same messy situation.

You bet there is a check. That is why we did not delete EC 3.6.4.12 yet, from ENZYME. Don’t forget this EC number is not obsolete in our system. If I had deleted it, the result would have been fatal errors that before has resulted in complete crashes of nightly Swiss-Prot updates. To clarify, as long as EC 3.6.4.12 is used in UniProtKB it will still be active in ENZYME (and NOT obsolete). I am looking very much forward to the day I can delete EC 3.6.4.12. Cheers, Kristian

@cmungall Do you want to add the old EC:3.6.4.12 to the parent DNA helicase term as Steven suggests? Or would that create too many false positive annotations?

cmungall commented 1 month ago

Do you want to add the old EC:3.6.4.12 to the parent DNA helicase term as Steven suggests? Or would that create too many false positive annotations?

The main thing I want out of this ticket is an agreed on SOP for GO EC mapping management. I believe we have that:

While it seems expasy vs enzyme-database differences are rare I think it's still good to have a firm SOP here.

Should we take the DNA helicase specific case over to

I think the best solution here is to fix the uniprot-EC and uniprot-KW annotations at source and restore the mapping

But I am tending towards restoring the mappings anyway because it's not like it brings in new false positives as we are already getting misannotations from the KW!