geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Unmaintained annotation sets - potential issues #1724

Closed tonysawfordebi closed 4 years ago

tonysawfordebi commented 6 years ago

I've been taking a look at the annotation sets that we currently import into the GOA database and that have been identified as being unmaintained, namely those from JCVI/TIGR and PAMGO_Mgrisea.

I've run them through our syntax checker to identify any issues that might prevent us being able to take ownership of them, and that has thrown up a few things that probably need some discussion, and some decisions to be taken about any action to be taken (remedial or otherwise).

I've attached a bunch of files to this ticket, two for each of the annotation sets; one is the raw output of the syntax check, and the other is a slightly more detailed analysis of the problems encountered.

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

@pgaudet @ggeorghiou

tonysawfordebi commented 6 years ago

OK, I've done a preliminary analysis...

There are a total of 24,375 ISS annotations that we could potentially integrate (i.e., that are to JCVI/TIGR identifiers that we can map to a UniProt accession).

We have annotations from InterPro with exactly the same accession & GO ID combination for 13,074 of these.

This leaves 11,301 for which we might have annotations from InterPro to a less-specific (or - less probably - more-specific) term.

tonysawfordebi commented 6 years ago

Of the 11,301 annotations for which InterPro doesn't have an exact match, it does predict a less-specific term for 6,421 of them; the attached file contains a comparison of the manually-asserted and predicted terms, along with a count of the number of annotations to each term.

That now leaves us with 4,880 annotations unaccounted for.

term_comparsion.xlsx

ValWood commented 6 years ago

Hi Tony.

Could you provide a spreadsheet with the TIGR family included (also similar for the less specific annotations).

We could spot check some, many would perhaps the corresponding InterPro entry could have the mapping and we could ask @almitchell / @asangrador / InterPro to take a look.

For others, I worry that we are preserving many of these, https://github.com/geneontology/go-annotation/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aclosed++InterPro

which could have been made at the time by ISS, but the corresponding mappings have since been removed or otherwise updated.

Val

ValWood commented 6 years ago

OK the attached file seems to be the ones where the ISS mapping is more specific. These seem reasonable. They seem to be cases where the mapping cannot be specific . (i.e one can infer cytosolic ribosome, but the family includes proteins which are not cytosolic ribosome for example). Others are specificity of enzymes, or transporters. However in many of these cases the protein family isn't really the 'evidence' for the assignment . The evidence is a specific ortholog.

ggeorghiou commented 6 years ago

@ValWood so what are suggesting? That we keep these annotations or remove them?

ValWood commented 6 years ago

I think we need to look deeper and think of a better way to retain them...the best solution for the "more specific' set would be an InterPro mapping on the TIGR family if this is possible...

pgaudet commented 5 years ago

Hello,

Looking at some identifiers we cannot match:

I can find at least these 2 by searching for the gene name and the species in UniProt. Could we try doing some mapping like this ? @cmungall @alexsign @vanaukenk ?

Thanks, Pascale

pgaudet commented 4 years ago

The JCVI data is now fully loaded from GOA. The JCVI files have been removed from the datasets.yaml