Closed pgaudet closed 11 months ago
@dustine32 SPAC1039.06 is an example of a pombe IBA annotation missing in the release candidate, and present in the current release https://amigo.geneontology.org/amigo/gene_product/PomBase:SPAC1039.06 https://amigo-staging.geneontology.io/amigo/gene_product/PomBase:SPAC1039.06
So, I dont know where it gets filtered or dropped.
@pgaudet Thanks for this example! It helped me realize that this is the "no symbol in col 3" GO rule check causing this drop for at least 902 pombe annotations. Here are the two SPAC1039.06
IBA lines in pombase.gaf current
(pre-GO rule symbol check):
PomBase SPAC1039.06 involved_in GO:0036088 PMID:21873635 IBA PANTHER:PTN001997216|dictyBase:DDB_G0279015 P protein_coding_gene taxon:284812 20181123 GO_Central
PomBase SPAC1039.06 enables GO:0008721 PMID:21873635 IBA PANTHER:PTN001997216|SGD:S000003164|dictyBase:DDB_G0279015 F protein_coding_gene taxon:284812 20181123 GO_Central
This exposes two issues:
paint_pombase-report.html
or pombase-report.html
.I can work on point 1, obviously. Maybe getting a fixed set of PAINT GAFs out is our quick bandaid for this?
Tagging @mugitty (once she gets back) on point 2 to figure out why these messages aren't getting into the reports.
Looking into this further, fixing this in PAINT may be easier. The issue with the ' no symbol in col 3' check is that it is part of gorule-0000001, which encompasses MANY different things, so we need to decide which items are ERRORs and which ones are WARNINGs.
I'll discuss this with Anushya next time we talk.
Following the ontobio validate.py
GAF parse/write code in the debugger (probably for the eighth time), I think I've extracted the basic flow of how the PAINT IBA GAF symbol column is becoming blank:
pombase.gaf
is parsed and a pombase.gpi
is derived from it.pombase.gpi
(from MOD, different from pombase.gpi
generated in step 1) is parsed and merged into the list of bioentities derived from upstream GAF. This upstream GPI line for SPAC1039.06
having blank symbol ''
clobbers/overwrites the symbol value (SPAC1039.06
) extracted from the derived GPI.SPAC1039.06
overwrites the field values in the PAINT GAF.pombase.gaf
(pombase_valid.gaf
) to produce the final annotations/pombase.gaf
product file.@pgaudet So, one thing here is that the upstream PomBase GPI should probably have some value in the symbol column, even if it's just repeating the ID column. The other is what ontobio
should do when reading a GPI and either symbol
or name
columns are blank - I'm thinking we should just toss out these GPI lines to prevent carrying forward the bad entity data?
The GPI symbol-checking fix could go somewhere in this function: https://github.com/biolink/ontobio/blob/f86d367fa5c4c85ea6ce8743166ac072a6d66115/ontobio/io/entityparser.py#L285
@dustine32 I'd be sympathetic to the argument that your code is doing the right thing here: it's correctly transmitting the GPI information, which is generally regarded as the source of truth for gene product information. While one could add the guardrails that when merging the GAF-derived-GPI info into the canonical GPI-derived info, that if there are blanks the GAF-derived-GPI info can be used, but the fact remains that the canonical information is "wrong-ish". The other approach would be to not merge and just use the GAF-derived-GPI.
@kltm Right, thinking again about chucking out canonical GPI entities just cuz they're missing symbol, it'll probably have some unintended consequences after we implemented https://github.com/geneontology/go-site/issues/2066. I kind of don't want to touch this part now.
Instead, we could just repair any blank symbol to its CURIE "identity" (The "SPAC1039.06" part of "PomBase:SPAC1039.06") when writing a GAF out?
@dustine32 I think part of the issue is a slight dissonance in the data flow. There are supposed to be two formats: GAF and GPAD/GPI. Ideally, for any data source, we're dealing with one or the other and that is the canonical source for information about that source. Easy peasy. In this case, however, we're essentially trying to blend two things and neither is considered canonical the SoT, so a little weird. I kinda feel like the "correct" answer is to get the GPI fixed. In this case, I might advocate fixing the GPI ourselves for this release and then try and work with the upstream to get these changes incorporated. Besides the obvious problems, is there much of a difference between the two?
Talking to @dustine32 , he'll look at fixing the merge function in ontobio.
Is there a problem with our GAF file?
@ValWood I think there is something wrong with your GPI file, see https://github.com/geneontology/go-releases/issues/50#issuecomment-1745482235
Because not all rows have a symbol in column 3, some complicated script generates another file. Maybe it would be cleaner if the GPI file did not have to be reconstructed?
Yes sure, we did not know about it, or we would fix it.
Tagging @kimrutherford
@ValWood @kimrutherford Noting that the current PomBase GPI has 8175 entities with a blank symbol field:
$ curl -L https://www.pombase.org/data/annotations/Gene_ontology/pombase.gpi.gz | gunzip | grep -v -e "^\!" | awk 'BEGIN {FS="\t"};$2==""' | wc -l
8175
My current (mostly unfounded) suspicion is that QuickGO has a GPI QA step that removes these no-symbol entities from annotation with/from fields.
@pgaudet With https://github.com/biolink/ontobio/pull/648, the GO pipeline is now much more forgiving of no-symbol GPI lines and just fills them in with the object ID to prevent loss of annotations due to the "no symbol in col 3" GO rule check.
Noting that the current PomBase GPI has 8175 entities with a blank symbol field:
Thanks for spotting that. I've fixed our code so the PomBase GPI file should be OK in the morning (UK time).
-2254 ISOs >> ok expected
PAINT: