Closed pgaudet closed 5 months ago
@tonysawfordebi I have no idea what that's implemented, but isn't SO:0000185 (primary transcript) an rna ?
In all honesty, I don't know. I'd need to check what it says in go-upper.obo.
Just looked at http://snapshot.geneontology.org/ontology/extensions/go-upper.obo and there's no mention of SO:0000185
Maybe that's what's missing?
there's no mention of rna either! @cmungall @kltm
Another thing I've just found is that the checker is spitting out about 15K annotations that have CGD IDs in their with/from that don't conform to the id_syntax regexp in db_xrefs.yaml.
db_xrefs says that CGD IDs should match (CAL|CAF)[0-9]{7}, but what we have in the PAINT GAFs appear to match (CAL|CAF)[0-9]{10}
For example, CGD:CAL0000179664
What is the correct form of a CGD ID?
What is the correct form of a CGD ID?
@dougli1sqrd where is that information already?
from db_xrefs.yaml: https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml#L350
entity_types:
- type_name: gene
type_id: SO:0000704
id_syntax: (CAL|CAF)[0-9]{7}
url_syntax: http://www.candidagenome.org/cgi-bin/locus.pl?dbid=[example_id]
example_id: CGD:CAL0005516
example_url: http://www.candidagenome.org/cgi-bin/locus.pl?dbid=CAL0005516
My understanding is whatever is in db_xrefs is the standard.
It looks like db_xrefs.yaml is out of date - I've checked the CGD GAF, and all of the identifiers conform to (CAL|CAF)[0-9]{10} (as do all the UniProt CGD xrefs).
I'll update the CGD entry.
Here's an updated report, now that db_xrefs.yaml has been updated:
Number of lines processed: 2203905 Total number of annotations: 2203891 Number of annotations assigned by GO_Central: 2203891 Total number of problems detected: 737524 Number of annotations with error "Obsolete GO ID": 940 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 29 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 18 Number of annotations with error "Secondary GO ID": 196 Number of annotations with error "Unsupported / unmapped identifier": 309321 Number of annotations with error "Unsupported qualifier": 45647 Number of annotations with error "With/from contains one or more invalid components": 381373 Total number of warnings: 0 Number of annotations with no errors: 1517743
Number of annotations with invalid with/from components: 381373 ECO:0000318 (IBA) - valid entity types: CHEBI:33697 (ribonucleic acid) or NCIT:C20130 (protein family) or PR:000000001 (protein) or SO:0000704 (gene) TAIR [BET:0000000 (communication) or SO:0000185 (primary transcript) or SO:0000704 (gene)]: 380720 WB [PR:000000001 (protein) or SO:0000704 (gene) or VariO:0001 (variation)]: 653
Number of annotations with unmapped identifiers: 309321 Number of unmapped identifiers: 72040
Number of annotations that refer to a secondary GO ID: 196 GO:0005329 (replaced by GO:0005330): 20 GO:0005333 (replaced by GO:0005334): 24 GO:0005605 (replaced by GO:0005604): 13 GO:0015222 (replaced by GO:0005335): 130 GO:0070283 (replaced by GO:1904047): 9
Number of annotations that refer to an obsolete GO ID: 940 GO:0000989 (no replacement term defined): 115 GO:0000991 (no replacement term defined): 188 GO:0001076 (no replacement term defined): 178 GO:0001129 (no replacement term defined): 269 GO:0001190 (no replacement term defined): 11 GO:0001191 (no replacement term defined): 179
Number of annotations with an unknown or unsupported qualifier: 45647 COLOCALIZES_WITH: 10371 CONTRIBUTES_TO: 35276
Number of annotations to restricted GO terms: 47 gocheck_do_not_annotate GO:0040007 (growth): 29 gocheck_do_not_manually_annotate GO:0006950 (response to stress): 18
@pgaudet @huaiyumi @dustine32 As requested on the annotation call last week here are the checks the PAINT previously did, but have been dropped (based on Tony's v. useful report, which mirrors the checks that used to be in place). These should be in both PAINT itself, for realtime feedback to the curators, and in the export function (touchup) to catch those annotations that have gone stale. (note, these are not in any particular order, just a list)
Think this covers PAINT & the GAF export. The union of Tony's checks (the ID checks weren't in PAINT before) and these checks (Tony's don't have 4&5 or 6&7, and he couldn't because he doesn't have the tree info)
As I've mentioned before the validation module for PAINT annotations should be a standalone service that both PAINT itself and the exporter can use. The service I called touchup.
Here's a final (for now, at least) report, now that I've eliminated most of the "unmapped identifier" errors.
Number of lines processed: 2203905 Total number of annotations: 2203891 Number of annotations assigned by GO_Central: 2203891 Total number of problems detected: 431268 Number of annotations with error "Obsolete GO ID": 940 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 29 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 18 Number of annotations with error "Secondary GO ID": 196 Number of annotations with error "Unsupported / unmapped identifier": 3065 Number of annotations with error "Unsupported qualifier": 45647 Number of annotations with error "With/from contains one or more invalid components": 381373 Total number of warnings: 0 Number of annotations with no errors: 1775724
Number of annotations with invalid with/from components: 381373 ECO:0000318 (IBA) - valid entity types: CHEBI:33697 (ribonucleic acid) or NCIT:C20130 (protein family) or PR:000000001 (protein) or SO:0000704 (gene) TAIR [BET:0000000 (communication) or SO:0000185 (primary transcript) or SO:0000704 (gene)]: 380720 WB [PR:000000001 (protein) or SO:0000704 (gene) or VariO:0001 (variation)]: 653
Number of annotations with unmapped identifiers: 3065 Number of unmapped identifiers: 846
Number of annotations that refer to a secondary GO ID: 196 GO:0005329 (replaced by GO:0005330): 20 GO:0005333 (replaced by GO:0005334): 24 GO:0005605 (replaced by GO:0005604): 13 GO:0015222 (replaced by GO:0005335): 130 GO:0070283 (replaced by GO:1904047): 9
Number of annotations that refer to an obsolete GO ID: 940 GO:0000989 (no replacement term defined): 115 GO:0000991 (no replacement term defined): 188 GO:0001076 (no replacement term defined): 178 GO:0001129 (no replacement term defined): 269 GO:0001190 (no replacement term defined): 11 GO:0001191 (no replacement term defined): 179
Number of annotations with an unknown or unsupported qualifier: 45647 COLOCALIZES_WITH: 10371 CONTRIBUTES_TO: 35276
Number of annotations to restricted GO terms: 47 gocheck_do_not_annotate GO:0040007 (growth): 29 gocheck_do_not_manually_annotate GO:0006950 (response to stress): 18
Thanks @tonysawfordebi Can you give an example of annotations with invalid with/from components?
Here are a couple:
Line 3: ERROR With/from contains one or more invalid components [[ECO:0000318 (IBA)] [TAIR:locus:2087168]] 3> UniProtKB Q9H310 enables GO:0008519 PMID:21873635 ECO:0000318 MGI:MGI:1888517|MGI:MGI:1927379|PANTHER:PTN000198157|RGD:727859|TAIR:locus:2087168|TAIR:locus:2087173|TAIR:locus:2117758|TAIR:locus:2140877|UniProtKB:Q02094|UniProtKB:Q4VUI0|UniProtKB:Q9H310|UniProtKB:Q9UBD6 20170228 GO_Central
Line 17: ERROR With/from contains one or more invalid components [[ECO:0000318 (IBA)] [TAIR:locus:2077632]] 17> UniProtKB Q9HC62 involved_in GO:0016926 PMID:21873635 ECO:0000318 FB:FBgn0027603|MGI:MGI:1923076|MGI:MGI:2445054|PANTHER:PTN000288424|PomBase:SPBC19G7.09|SGD:S000001293|SGD:S000005941|TAIR:locus:2077632|TAIR:locus:2130864|UniProtKB:A0A1D8PIW0|UniProtKB:A0A1D8PSK4|UniProtKB:Q5B9U1|UniProtKB:Q9HC62|UniProtKB:Q9P0U3|WB:WBGene00006736|WB:WBGene00006737 20170228 GO_Central
Fixing comment for @selewis: @kltm @huaiyumi @dougli1sqrd @dustine23 would you look and see if these errors are present in the export. Just want to figure out where these odd 'with' values are arising.
Sorry about that, of course I wouldn't do that intentionally (not that dumb) but it wasn't at all obvious that gmail was doing this behind the scenes.
It may be that the problem is not with the with/from values per se, but with the metadata rules that drive the validation process.
Here's what I wrote in an off-line conversation with @pgaudet about this:
The rules for what constitutes a valid with/from for any evidence code are defined in https://github.com/geneontology/go-site/blob/master/metadata/eco-usage-constraints.yaml
The entry for ECO:0000318 (IBA) looks like:
- eco_id: ECO:0000318 go_evidence: IBA with_presence: mandatory with_structure: simple with_entities: - entity_type: *gene - entity_type: *protein - entity_type: *protein_family - entity_type: *rna
This says that IBA annotations must have a with/from, and that it consist of components that identify genes, proteins, protein families or RNAs.
These entity types are defined at the top of the file, thus:
- &protein id: PR:000000001 name: protein - &gene id: SO:0000704 name: gene - &genotype id: SO:0001027 name: genotype - &protein_family id: NCIT:C20130 name: protein family - &rna id: CHEBI:33697 name: ribonucleic acid
We now need to turn our attention to https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml, which specifies - for any database that mints identifiers - what type of entity those identifiers refer to, and what form they take (i.e., what the syntax of the identifier is).
The entry for TAIR looks like:
- database: TAIR name: The Arabidopsis Information Resource rdf_uri_prefix: http://identifiers.org/tair.locus/ generic_urls: - http://www.arabidopsis.org/ entity_types: - type_name: gene type_id: SO:0000704 id_syntax: gene:[0-9]{7,12} url_syntax: http://arabidopsis.org/servlets/TairObject?accession=[example_id] example_id: TAIR:gene:2062713 example_url: http://arabidopsis.org/servlets/TairObject?accession=gene:2062713 - type_name: communication type_id: BET:0000000 id_syntax: Communication:[0-9]{7,12} url_syntax: http://arabidopsis.org/servlets/TairObject?type=communication&id=[example_id] example_id: TAIR:Communication:1345790 example_url: http://arabidopsis.org/servlets/TairObject?type=communication&id=1345790 - type_name: primary transcript type_id: SO:0000185 id_syntax: locus:[0-9]{7} url_syntax: http://arabidopsis.org/servlets/TairObject?accession=[example_id] example_id: TAIR:locus:2146653 example_url: http://arabidopsis.org/servlets/TairObject?accession=locus:2146653
This says that TAIR mints identifiers for the entity types gene, communication, and primary transcript, and those identifiers have the form gene:[0-9]{7,12}, Communication:[0-9]{7,12}, and locus:[0-9]{7}, respectively.
So, an identifier like TAIR:locus:2087168 refers to a primary transcript, which is not one of the entity types that is considered valid in an IBA with/from.
There is one final metadata file that is used in the validation process, namely http://purl.obolibrary.org/obo/go/snapshot/extensions/go-upper.obo. This file contains a classification of entity types, so if, for example, primary transcript is_a rna, then this is the place where that should be stated, but it isn't.
I don't know what the correct solution to all this is - whether it's to extend the IBA entry in eco-usage-constraints.yaml, or to include some additional classification in go-upper.obo - but what I can say for sure is that - given the current state of the metadata files, TAIR:locus IDs are not considered valid components for an IBA with/from.
Thanks @tonysawfordebi for the fantastic explanation! I'm looking into who can/should update the metadata files for the TAIR IDs.
For the 653 invalid WB with/from values, do you have a few examples?
@dustine32
GO:0015222 will be fixed on next update due to change in https://github.com/pantherdb/fullgo_paint_update/issues/12. The other obsoleted terms previously mentioned (e.g. GO:0000989, GO:0000991) were obsoleted too recently (7-29-18) for this update. Should get corrected on next update.
Need to check at the next update
@selewis From the rules you mention above, I think this may lead to confusion:
Use the current, most update ontology version (not an out-of-date copy, and the ontology is updated on a daily basis).
Now that we do monthly releases in AmiGO, using the most up to date ontology for generating PAINT GAFs may result in differences (for example, an obsolete GO term that is not obsolete yet in AmiGO). Ideally the same versions should be used. What do you think ?
@dustine32 Apologies for the delayed reply - I was away from the office last week
With regard to the invalid WB IDs, I've taken a look and it seems that in the current set of PAINT files there are only two distinct WB IDs that don't conform to the db-xrefs.yaml specifications, and they are:
WB:F09E5.15c WB:F52H2.2b
Thanks @tonysawfordebi ! Will try to track down what's up with these IDs.
Now that the issue with the TAIR:locus IDs has been fixed (*), I ran the current set of PAINT files through our checker again, and this is the current state of play:
Number of lines processed: 2203905 Total number of annotations: 2203891 Number of annotations assigned by GO_Central: 2203891 Total number of problems detected: 50548 Number of annotations with error "Obsolete GO ID": 940 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 29 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 18 Number of annotations with error "Secondary GO ID": 196 Number of annotations with error "Unsupported / unmapped identifier": 3065 Number of annotations with error "Unsupported qualifier": 45647 Number of annotations with error "With/from contains one or more invalid components": 653 Total number of warnings: 0 Number of annotations with no errors: 2153425
Number of annotations with invalid with/from components: 653 ECO:0000318 (IBA) - valid entity types: CHEBI:33697 (ribonucleic acid) or NCIT:C20130 (protein family) or PR:000000001 (protein) or SO:0000185 (primary transcript) or SO:0000704 (gene) WB [PR:000000001 (protein) or SO:0000704 (gene) or VariO:0001 (variation)]: 653
Number of annotations with unmapped identifiers: 3065 Number of unmapped identifiers: 846
Number of annotations that refer to a secondary GO ID: 196 GO:0005329 (replaced by GO:0005330): 20 GO:0005333 (replaced by GO:0005334): 24 GO:0005605 (replaced by GO:0005604): 13 GO:0015222 (replaced by GO:0005335): 130 GO:0070283 (replaced by GO:1904047): 9
Number of annotations that refer to an obsolete GO ID: 940 GO:0000989 (no replacement term defined): 115 GO:0000991 (no replacement term defined): 188 GO:0001076 (no replacement term defined): 178 GO:0001129 (no replacement term defined): 269 GO:0001190 (no replacement term defined): 11 GO:0001191 (no replacement term defined): 179
Number of annotations with an unknown or unsupported qualifier: 45647 COLOCALIZES_WITH: 10371 CONTRIBUTES_TO: 35276
Number of annotations to restricted GO terms: 47 gocheck_do_not_annotate GO:0040007 (growth): 29 gocheck_do_not_manually_annotate GO:0006950 (response to stress): 18
The biggest single problem now is the bad qualifiers, which just need to be converted to lower case.
(*) I had to update the id_syntax for TAIR:locus IDs in db_xrefs.yaml from locus:[0-9]{7} to locus:[0-9]{7,12}
@huaiyumi I thought the qualifier case was fixed?
@tonysawfordebi Which PAINT file are you looking at ?
I'm grabbing paint_*.gpad.gz from http://snapshot.geneontology.org/products/annotations
@pgaudet sorry, I hadn't rerun @huaiyumi 's script changes to fix the qualifier case yet. I was planning on just running it on this month's update, which I'm starting today.
@huaiyumi @dustine32 I'm surprised to see any of these errors "Obsolete GO ID", "gocheck_do_not_annotate", "gocheck_do_not_manually_annotate" "Secondary GO ID"
The exporter should be catching all of these (actually PAINT used to, so that's regression where something that was working is now gone). For obsoletes and secondary IDs the correct ID should be swapped in. For the others the annotations should be removed (and it should be fixed in PAINT) Are these being fixed? (along with the qualifiers, etc.)
I've just run the checker over the latest set of PAINT files, and things are looking a lot healthier now.
Here's the latest summary:
Number of lines processed: 2218253 Total number of annotations: 2218239 Number of annotations assigned by GO_Central: 2218239 Total number of problems detected: 4207 Number of annotations with error "Qualifier not appropriate for GO term": 338 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 29 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 18 Number of annotations with error "Unsupported / unmapped identifier": 3169 Number of annotations with error "With/from contains one or more invalid components": 653 Total number of warnings: 0 Number of annotations with no errors: 2214035
Number of annotations with invalid with/from components: 653 ECO:0000318 (IBA) - valid entity types: CHEBI:33697 (ribonucleic acid) or NCIT:C20130 (protein family) or PR:000000001 (protein) or SO:0000185 (primary transcript) or SO:0000704 (gene) WB [PR:000000001 (protein) or SO:0000704 (gene) or VariO:0001 (variation)]: 653
Number of annotations with unmapped identifiers: 3169 Number of unmapped identifiers: 858
Number of annotations with inappropriate qualifier: 338 contributes_to GO:0006470: 300 GO:0016567: 24 GO:0070102: 14
Number of annotations to restricted GO terms: 47 gocheck_do_not_annotate GO:0040007 (growth): 29 gocheck_do_not_manually_annotate GO:0006950 (response to stress): 18
@alexsign Out of curiosity - would you please rerun the P2GO QC checks on the PAINT set ?
@pgaudet here it is. keep in mind taxon constrain violations excluded from the report bellow which is about 6k. I changed our pipeline to removes them on our side. It will run in production first time this weekend.
Number of lines processed: 3607071 Total number of annotations: 3607057 Number of annotations assigned by GO_Central: 3607057 Total number of annotations excluded: 20779 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 6876 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 8629 Number of annotations with error "Unsupported / unmapped identifier": 3699 Number of annotations with error "With/from contains one or more invalid components": 1575 Total number of warnings: 0 Number of annotations with no errors: 3586278 Number of annotations output: 3817661
@pgaudet might be able to close?
Migrated from #1939
Citing @tonysawfordebi
Just for fun, I grabbed all of the GAFs from ftp://ftp.pantherdb.org/downloads/paint/presubmission and ran them through our checker, and this is the summary of what it found (I won't post the whole log here, as it's > 250MB): `SUMMARY
Number of lines processed: 2206254 Total number of annotations: 2206202 Number of annotations assigned by GO_Central: 2206202 Total number of problems detected: 584925 Number of annotations with error "Obsolete GO ID": 931 Number of annotations with error "Restricted GO term: gocheck_do_not_annotate": 29 Number of annotations with error "Restricted GO term: gocheck_do_not_manually_annotate": 18 Number of annotations with error "Secondary GO ID": 196 Number of annotations with error "Unsupported qualifier": 45275 Number of annotations with error "With/from contains one or more invalid components": 538476 Total number of warnings: 0 Number of annotations with no errors: 1625116 ANALYSIS
Number of annotations with invalid with/from components: 538476 ECO:0000318 (IBA) - valid entity types: CHEBI:33697 (ribonucleic acid) or NCIT:C20130 (protein family) or PR:000000001 (protein) or SO:0000704 (gene) CGD [SO:0000704 (gene)]: 15263 EcoGene [entity type not known]: 142123 TAIR [BET:0000000 (communication) or SO:0000185 (primary transcript) or SO:0000704 (gene)]: 380658 WB [PR:000000001 (protein) or SO:0000704 (gene) or VariO:0001 (variation)]: 432
Number of annotations that refer to a secondary GO ID: 196 GO:0005329 (replaced by GO:0005330): 20 GO:0005333 (replaced by GO:0005334): 24 GO:0005605 (replaced by GO:0005604): 13 GO:0015222 (replaced by GO:0005335): 130 GO:0070283 (replaced by GO:1904047): 9
Number of annotations that refer to an obsolete GO ID: 931 GO:0000989 (no replacement term defined): 115 GO:0000991 (no replacement term defined): 186 GO:0001076 (no replacement term defined): 176 GO:0001129 (no replacement term defined): 265 GO:0001190 (no replacement term defined): 11 GO:0001191 (no replacement term defined): 178
Number of annotations with an unknown or unsupported qualifier: 45275 COLOCALIZES_WITH: 10328 CONTRIBUTES_TO: 34947
Number of annotations to restricted GO terms: 47 gocheck_do_not_annotate GO:0040007 (growth): 29 gocheck_do_not_manually_annotate GO:0006950 (response to stress): 18 ` As you can see, the largest single class of error is from IBA annotations that refer to a TAIR ID in their with/from, for example:
Line 20: ERROR With/from contains one or more invalid components [[ECO:0000318 (IBA)] [TAIR:locus:2130864]] 20> UniProtKB Q9HC62 SENP2 GO:0016926 PMID:21873635 IBA PANTHER:PTN000288424|UniProtKB:Q9HC62|SGD:S000005941|UniProtKB:Q9P0U3|MGI:MGI:2445054|WB:WBGene00006737|SGD:S000001293|UniProtKB:A0A1D8PSK4|PomBase:SPBC19G7.09|FB:FBgn0027603|TAIR:locus:2130864|MGI:MGI:1923076|TAIR:locus:2077632|UniProtKB:Q5B9U1|WB:WBGene00006736|UniProtKB:A0A1D8PIW0 P Sentrin-specific protease 2 UniProtKB:Q9HC62|PTN002489016 protein taxon:9606 2017-02-28 GO_Central According to https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml TAIR:locus IDs are of type SO:0000185 (primary transcript), but https://github.com/geneontology/go-site/blob/master/metadata/eco-usage-constraints.yaml states that the with/from for IBA (ECO:0000318) annotations can consist of entities of type gene, protein, protein family, and rna.
Is some adjustment required somewhere?