geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
35 stars 10 forks source link

Conflicting Gene Ontology terms applied to UniProt entry P16870 #1939

Closed ndrawlings closed 6 years ago

ndrawlings commented 6 years ago

UniProt entry P16870 includes human carboxypeptidase E, a metallopeptidase from MEROPS family M14. It has been incorrectly assigned the GO term GO:0004185 for serine-type carboxypeptidase activity. It is correctly assigned the GO term GO:0004181 for metallocarboxypeptidase activity.

ggeorghiou commented 6 years ago

Hi everyone, just to weigh in quickly - the annotation in question is one from PAINT and I've labeled the issue as such

pgaudet commented 6 years ago

Thanks @ggeorghiou Next time you can also assign me.

Thanks, Pascale

pgaudet commented 6 years ago

Looks like the primary annotations were removed (omitted annotations report in PAINT states:

PTN001355152 annotated to GO:0004185 with evidence code IBA for annotation id 425143069 has paint evidence code but not PAINT evidence type GO_REF

So we cannot remove it manually @huaiyumi Should we be able to remove these directly in PAINT for that they are cleaned form the DB ?

The annotation is still in QuickGO and AmiGO.

pgaudet commented 6 years ago

Now I find the annotation in QuickGO but not in AmiGO, and not in PAINT.
@tonysawfordebi is QuickGO up to date?

tonysawfordebi commented 6 years ago

@pgaudet Yes, and the annotation is still in ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint/pre-submission/gene_association.paint_goa_human.gz

Have the PAINT files moved? Should we be picking them up from somewhere else?

pgaudet commented 6 years ago

Hi @tonysawfordebi

Looks like you are picking up the old files. The current human file is at: ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_human.gaf.gz, in accordance with the info in the yaml file: https://github.com/geneontology/go-site/blob/master/metadata/datasets/paint.yaml

(The site you mention: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint/pre-submission/gene_association.paint_goa_human.gz doesn't seem to have been updated since 2017, by looking at the dates of the files here: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint).

@dougli1sqrd Can you confirm the information I gave is correct ?

Thanks, Pascale

tonysawfordebi commented 6 years ago

In that case, can I suggest that the files be removed from the old location and replaced with a README pointing people to the correct location?

tonysawfordebi commented 6 years ago

PS: was this change of location announced anywhere?

pgaudet commented 6 years ago

At the GOC meeting, and it's here (in some form): http://wiki.geneontology.org/index.php/Release_Pipeline

I dont think that was very clear though, what was done and what was only planned. We are trying to sort out the documentation, but at least for PAINT we could announce something.

Sorry about that.

tonysawfordebi commented 6 years ago

OK, I'll modify our import pipeline accordingly.

With regard to the new location, I see that the files are named like "gene_association.paint_something.gaf.gz" - isn't that a bit redundant? "paint_something.gaf.gz" would be sufficient.

Also, I do recall a discussion at the NY meeting about publishing the files in GPAD format, thus allowing more information to be captured - is that likely to be happening any time soon?

pgaudet commented 6 years ago

Hi @tonysawfordebi

  1. Please wait for @cmungall @kltm or @dougli1sqrd to comment before changing the link. The preferred link might be: http://snapshot.geneontology.org/products/annotations/paint_mgi.gaf.gz (which is PAINT files from the link I sent, reprocessed somehow and reexported by the pipeline)

  2. @huaiyumi can comment on file names but I think these were always constructed sort of like this, maybe it's a hassle for any script to change it - I dont have strong feelings.

  3. WRT GPAD, my recollection is that we had asked @huaiyumi to do that, but if the GO pipeline processes and exports the files, perhaps it makes more sense that the GPAD files would be exported at that point ? @cmungall what do you think ?

Thanks, Pascale

tonysawfordebi commented 6 years ago

OK

tonysawfordebi commented 6 years ago

Just for fun, I grabbed all of the GAFs from ftp://ftp.pantherdb.org/downloads/paint/presubmission and ran them through our checker, and this is the summary of what it found (I won't post the whole log here, as it's > 250MB):

`SUMMARY

Number of lines processed: 2206254 Total number of annotations: 2206202 Number of annotations assigned by GO_Central: 2206202 Total number of problems detected: 584925 Number of annotations with error "Obsolete GO ID": 931 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 29 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 18 Number of annotations with error "Secondary GO ID": 196 Number of annotations with error "Unsupported qualifier": 45275 Number of annotations with error "With/from contains one or more invalid components": 538476 Total number of warnings: 0 Number of annotations with no errors: 1625116

ANALYSIS

Number of annotations with invalid with/from components: 538476 ECO:0000318 (IBA) - valid entity types: CHEBI:33697 (ribonucleic acid) or NCIT:C20130 (protein family) or PR:000000001 (protein) or SO:0000704 (gene) CGD [SO:0000704 (gene)]: 15263 EcoGene [entity type not known]: 142123 TAIR [BET:0000000 (communication) or SO:0000185 (primary transcript) or SO:0000704 (gene)]: 380658 WB [PR:000000001 (protein) or SO:0000704 (gene) or VariO:0001 (variation)]: 432

Number of annotations that refer to a secondary GO ID: 196 GO:0005329 (replaced by GO:0005330): 20 GO:0005333 (replaced by GO:0005334): 24 GO:0005605 (replaced by GO:0005604): 13 GO:0015222 (replaced by GO:0005335): 130 GO:0070283 (replaced by GO:1904047): 9

Number of annotations that refer to an obsolete GO ID: 931 GO:0000989 (no replacement term defined): 115 GO:0000991 (no replacement term defined): 186 GO:0001076 (no replacement term defined): 176 GO:0001129 (no replacement term defined): 265 GO:0001190 (no replacement term defined): 11 GO:0001191 (no replacement term defined): 178

Number of annotations with an unknown or unsupported qualifier: 45275 COLOCALIZES_WITH: 10328 CONTRIBUTES_TO: 34947

Number of annotations to restricted GO terms: 47 gocheck_do_not_annotate GO:0040007 (growth): 29 gocheck_do_not_manually_annotate GO:0006950 (response to stress): 18 ` As you can see, the largest single class of error is from IBA annotations that refer to a TAIR ID in their with/from, for example:

Line 20: ERROR With/from contains one or more invalid components [[ECO:0000318 (IBA)] [TAIR:locus:2130864]] 20> UniProtKB Q9HC62 SENP2 GO:0016926 PMID:21873635 IBA PANTHER:PTN000288424|UniProtKB:Q9HC62|SGD:S000005941|UniProtKB:Q9P0U3|MGI:MGI:2445054|WB:WBGene00006737|SGD:S000001293|UniProtKB:A0A1D8PSK4|PomBase:SPBC19G7.09|FB:FBgn0027603|TAIR:locus:2130864|MGI:MGI:1923076|TAIR:locus:2077632|UniProtKB:Q5B9U1|WB:WBGene00006736|UniProtKB:A0A1D8PIW0 P Sentrin-specific protease 2 UniProtKB:Q9HC62|PTN002489016 protein taxon:9606 2017-02-28 GO_Central According to https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml TAIR:locus IDs are of type SO:0000185 (primary transcript), but https://github.com/geneontology/go-site/blob/master/metadata/eco-usage-constraints.yaml states that the with/from for IBA (ECO:0000318) annotations can consist of entities of type gene, protein, protein family, and rna.

Is some adjustment required somewhere?

pgaudet commented 6 years ago

@tonysawfordebi I opened a new ticket.

kltm commented 6 years ago

To clarify, for internal people, the main annotation products of the GO are a single merged file that has gone through filtering, QC, etc--including annotations direct-from-MOD, PAINT, and (in the future) Noctua:

http://snapshot.geneontology.org/annotations/mgi.gaf.gz

Not really publicly advertised, but for GO "internal" use, we have a several intermediate files that are available; for example, just the QCed PAINT GAF:

http://snapshot.geneontology.org/products/annotations/paint_mgi.gaf.gz

Note that these are all coming off of the snapshot server, which is intended for internal use. For the monthly public "releases", we use either http://current.geneontology.org or http://release.geneontology.org (same thing, but versioned), which comes out monthly(ish); see: http://wiki.geneontology.org/index.php/Release_Pipeline

Okay, as part of that monthly public release, we (for the time being) push files back into the legacy location at the GO SVN, which also seems to have an expression here: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint . This is only there to not break peoples' pipeline at the moment--in general, people should be using either snapshot (internal/daily) or current (public/monthly).

tonysawfordebi commented 6 years ago

Because of the way that our import pipelines work, it is easier for us to process all of the PAINT annotations separately, rather than getting them from individual MODs' files.

Looking at http://snapshot.geneontology.org/products/annotations I see that there are several PAINT-related files for each MOD. For example, for MGI we have:

paint_mgi-prediction.gaf paint_mgi-src.gaf.gz paint_mgi.gaf.gz paint_mgi.gpad.gz paint_mgi.gpi.gz paint_mgi_noiea.gaf.gz paint_mgi_valid.gaf.gz

What's the difference between them all? Would I be right to assume that the ones I'm interested in are paint_mgi.[gaf|gpad].gz?