Closed ndrawlings closed 6 years ago
Hi everyone, just to weigh in quickly - the annotation in question is one from PAINT and I've labeled the issue as such
Thanks @ggeorghiou Next time you can also assign me.
Thanks, Pascale
Looks like the primary annotations were removed (omitted annotations report in PAINT states:
PTN001355152 annotated to GO:0004185 with evidence code IBA for annotation id 425143069 has paint evidence code but not PAINT evidence type GO_REF
So we cannot remove it manually @huaiyumi Should we be able to remove these directly in PAINT for that they are cleaned form the DB ?
The annotation is still in QuickGO and AmiGO.
Now I find the annotation in QuickGO but not in AmiGO, and not in PAINT.
@tonysawfordebi is QuickGO up to date?
@pgaudet Yes, and the annotation is still in ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint/pre-submission/gene_association.paint_goa_human.gz
Have the PAINT files moved? Should we be picking them up from somewhere else?
Hi @tonysawfordebi
Looks like you are picking up the old files. The current human file is at: ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_human.gaf.gz, in accordance with the info in the yaml file: https://github.com/geneontology/go-site/blob/master/metadata/datasets/paint.yaml
(The site you mention: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint/pre-submission/gene_association.paint_goa_human.gz doesn't seem to have been updated since 2017, by looking at the dates of the files here: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint).
@dougli1sqrd Can you confirm the information I gave is correct ?
Thanks, Pascale
In that case, can I suggest that the files be removed from the old location and replaced with a README pointing people to the correct location?
PS: was this change of location announced anywhere?
At the GOC meeting, and it's here (in some form): http://wiki.geneontology.org/index.php/Release_Pipeline
I dont think that was very clear though, what was done and what was only planned. We are trying to sort out the documentation, but at least for PAINT we could announce something.
Sorry about that.
OK, I'll modify our import pipeline accordingly.
With regard to the new location, I see that the files are named like "gene_association.paint_something.gaf.gz" - isn't that a bit redundant? "paint_something.gaf.gz" would be sufficient.
Also, I do recall a discussion at the NY meeting about publishing the files in GPAD format, thus allowing more information to be captured - is that likely to be happening any time soon?
Hi @tonysawfordebi
Please wait for @cmungall @kltm or @dougli1sqrd to comment before changing the link. The preferred link might be: http://snapshot.geneontology.org/products/annotations/paint_mgi.gaf.gz (which is PAINT files from the link I sent, reprocessed somehow and reexported by the pipeline)
@huaiyumi can comment on file names but I think these were always constructed sort of like this, maybe it's a hassle for any script to change it - I dont have strong feelings.
WRT GPAD, my recollection is that we had asked @huaiyumi to do that, but if the GO pipeline processes and exports the files, perhaps it makes more sense that the GPAD files would be exported at that point ? @cmungall what do you think ?
Thanks, Pascale
OK
Just for fun, I grabbed all of the GAFs from ftp://ftp.pantherdb.org/downloads/paint/presubmission and ran them through our checker, and this is the summary of what it found (I won't post the whole log here, as it's > 250MB):
Number of lines processed: 2206254 Total number of annotations: 2206202 Number of annotations assigned by GO_Central: 2206202 Total number of problems detected: 584925 Number of annotations with error "Obsolete GO ID": 931 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 29 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 18 Number of annotations with error "Secondary GO ID": 196 Number of annotations with error "Unsupported qualifier": 45275 Number of annotations with error "With/from contains one or more invalid components": 538476 Total number of warnings: 0 Number of annotations with no errors: 1625116
Number of annotations with invalid with/from components: 538476 ECO:0000318 (IBA) - valid entity types: CHEBI:33697 (ribonucleic acid) or NCIT:C20130 (protein family) or PR:000000001 (protein) or SO:0000704 (gene) CGD [SO:0000704 (gene)]: 15263 EcoGene [entity type not known]: 142123 TAIR [BET:0000000 (communication) or SO:0000185 (primary transcript) or SO:0000704 (gene)]: 380658 WB [PR:000000001 (protein) or SO:0000704 (gene) or VariO:0001 (variation)]: 432
Number of annotations that refer to a secondary GO ID: 196 GO:0005329 (replaced by GO:0005330): 20 GO:0005333 (replaced by GO:0005334): 24 GO:0005605 (replaced by GO:0005604): 13 GO:0015222 (replaced by GO:0005335): 130 GO:0070283 (replaced by GO:1904047): 9
Number of annotations that refer to an obsolete GO ID: 931 GO:0000989 (no replacement term defined): 115 GO:0000991 (no replacement term defined): 186 GO:0001076 (no replacement term defined): 176 GO:0001129 (no replacement term defined): 265 GO:0001190 (no replacement term defined): 11 GO:0001191 (no replacement term defined): 178
Number of annotations with an unknown or unsupported qualifier: 45275 COLOCALIZES_WITH: 10328 CONTRIBUTES_TO: 34947
Number of annotations to restricted GO terms: 47 gocheck_do_not_annotate GO:0040007 (growth): 29 gocheck_do_not_manually_annotate GO:0006950 (response to stress): 18 ` As you can see, the largest single class of error is from IBA annotations that refer to a TAIR ID in their with/from, for example:
Line 20: ERROR With/from contains one or more invalid components [[ECO:0000318 (IBA)] [TAIR:locus:2130864]] 20> UniProtKB Q9HC62 SENP2 GO:0016926 PMID:21873635 IBA PANTHER:PTN000288424|UniProtKB:Q9HC62|SGD:S000005941|UniProtKB:Q9P0U3|MGI:MGI:2445054|WB:WBGene00006737|SGD:S000001293|UniProtKB:A0A1D8PSK4|PomBase:SPBC19G7.09|FB:FBgn0027603|TAIR:locus:2130864|MGI:MGI:1923076|TAIR:locus:2077632|UniProtKB:Q5B9U1|WB:WBGene00006736|UniProtKB:A0A1D8PIW0 P Sentrin-specific protease 2 UniProtKB:Q9HC62|PTN002489016 protein taxon:9606 2017-02-28 GO_Central
According to https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml TAIR:locus IDs are of type SO:0000185 (primary transcript), but https://github.com/geneontology/go-site/blob/master/metadata/eco-usage-constraints.yaml states that the with/from for IBA (ECO:0000318) annotations can consist of entities of type gene, protein, protein family, and rna.
Is some adjustment required somewhere?
@tonysawfordebi I opened a new ticket.
To clarify, for internal people, the main annotation products of the GO are a single merged file that has gone through filtering, QC, etc--including annotations direct-from-MOD, PAINT, and (in the future) Noctua:
http://snapshot.geneontology.org/annotations/mgi.gaf.gz
Not really publicly advertised, but for GO "internal" use, we have a several intermediate files that are available; for example, just the QCed PAINT GAF:
http://snapshot.geneontology.org/products/annotations/paint_mgi.gaf.gz
Note that these are all coming off of the snapshot
server, which is intended for internal use. For the monthly public "releases", we use either http://current.geneontology.org or http://release.geneontology.org (same thing, but versioned), which comes out monthly(ish); see: http://wiki.geneontology.org/index.php/Release_Pipeline
Okay, as part of that monthly public release, we (for the time being) push files back into the legacy location at the GO SVN, which also seems to have an expression here: ftp://ftp.geneontology.org/pub/go/gene-associations/submission/paint . This is only there to not break peoples' pipeline at the moment--in general, people should be using either snapshot
(internal/daily) or current
(public/monthly).
Because of the way that our import pipelines work, it is easier for us to process all of the PAINT annotations separately, rather than getting them from individual MODs' files.
Looking at http://snapshot.geneontology.org/products/annotations I see that there are several PAINT-related files for each MOD. For example, for MGI we have:
paint_mgi-prediction.gaf paint_mgi-src.gaf.gz paint_mgi.gaf.gz paint_mgi.gpad.gz paint_mgi.gpi.gz paint_mgi_noiea.gaf.gz paint_mgi_valid.gaf.gz
What's the difference between them all? Would I be right to assume that the ones I'm interested in are paint_mgi.[gaf|gpad].gz?
UniProt entry P16870 includes human carboxypeptidase E, a metallopeptidase from MEROPS family M14. It has been incorrectly assigned the GO term GO:0004185 for serine-type carboxypeptidase activity. It is correctly assigned the GO term GO:0004181 for metallocarboxypeptidase activity.