Changes in S. pombe annotations

pgaudet commented 1 year ago

-2254 ISOs >> ok expected

PAINT:

+52 IBA annotations to a new taxon: taxon_subset_closure_label: Schizosaccharomyces pombe 972h-
-955 IBA annotations >> weird because other groups did not change so much

pgaudet commented 1 year ago

@dustine32 SPAC1039.06 is an example of a pombe IBA annotation missing in the release candidate, and present in the current release https://amigo.geneontology.org/amigo/gene_product/PomBase:SPAC1039.06 https://amigo-staging.geneontology.io/amigo/gene_product/PomBase:SPAC1039.06

I checked in PAINT, and the annotation is present in PAINT:

I checked in the PAINT error report, and I dont find SPAC1039.06
I checked open the PAINT pombase source file, and I find the annotations in there as well.
However I dont find the PAINT annotations in the products

So, I dont know where it gets filtered or dropped.

dustine32 commented 1 year ago

@pgaudet Thanks for this example! It helped me realize that this is the "no symbol in col 3" GO rule check causing this drop for at least 902 pombe annotations. Here are the two SPAC1039.06 IBA lines in pombase.gaf current (pre-GO rule symbol check):

PomBase SPAC1039.06     involved_in GO:0036088  PMID:21873635   IBA PANTHER:PTN001997216|dictyBase:DDB_G0279015 P           protein_coding_gene taxon:284812    20181123    GO_Central
PomBase SPAC1039.06     enables GO:0008721  PMID:21873635   IBA PANTHER:PTN001997216|SGD:S000003164|dictyBase:DDB_G0279015  F           protein_coding_gene taxon:284812    20181123    GO_Central

This exposes two issues:

The PAINT IBA GAF generation script needs to fix the empty symbol col 3 problem by duplicating the ID (if no symbol exists), similar to what upstream PomBase is doing with their GAF.
These "ERROR: no symbol in col 3" messages should be in the GO pipeline reports, either in paint_pombase-report.html or pombase-report.html.

I can work on point 1, obviously. Maybe getting a fixed set of PAINT GAFs out is our quick bandaid for this?

Tagging @mugitty (once she gets back) on point 2 to figure out why these messages aren't getting into the reports.

pgaudet commented 1 year ago

Looking into this further, fixing this in PAINT may be easier. The issue with the ' no symbol in col 3' check is that it is part of gorule-0000001, which encompasses MANY different things, so we need to decide which items are ERRORs and which ones are WARNINGs.

I'll discuss this with Anushya next time we talk.

dustine32 commented 1 year ago

Following the ontobio validate.py GAF parse/write code in the debugger (probably for the eighth time), I think I've extracted the basic flow of how the PAINT IBA GAF symbol column is becoming blank:

Upstream pombase.gaf is parsed and a pombase.gpi is derived from it.
Upstream pombase.gpi (from MOD, different from pombase.gpi generated in step 1) is parsed and merged into the list of bioentities derived from upstream GAF. This upstream GPI line for SPAC1039.06 having blank symbol '' clobbers/overwrites the symbol value (SPAC1039.06) extracted from the derived GPI.
This merged list of bioentities is queried for symbol/name data when parsing the PAINT pombase IBA GAF. The returned blank symbol/name data for SPAC1039.06 overwrites the field values in the PAINT GAF.
These "corrected" PAINT annotations are then appended to the validated upstream pombase.gaf (pombase_valid.gaf) to produce the final annotations/pombase.gaf product file.

@pgaudet So, one thing here is that the upstream PomBase GPI should probably have some value in the symbol column, even if it's just repeating the ID column. The other is what ontobio should do when reading a GPI and either symbol or name columns are blank - I'm thinking we should just toss out these GPI lines to prevent carrying forward the bad entity data?

The GPI symbol-checking fix could go somewhere in this function: https://github.com/biolink/ontobio/blob/f86d367fa5c4c85ea6ce8743166ac072a6d66115/ontobio/io/entityparser.py#L285

kltm commented 1 year ago

@dustine32 I'd be sympathetic to the argument that your code is doing the right thing here: it's correctly transmitting the GPI information, which is generally regarded as the source of truth for gene product information. While one could add the guardrails that when merging the GAF-derived-GPI info into the canonical GPI-derived info, that if there are blanks the GAF-derived-GPI info can be used, but the fact remains that the canonical information is "wrong-ish". The other approach would be to not merge and just use the GAF-derived-GPI.

dustine32 commented 1 year ago

@kltm Right, thinking again about chucking out canonical GPI entities just cuz they're missing symbol, it'll probably have some unintended consequences after we implemented https://github.com/geneontology/go-site/issues/2066. I kind of don't want to touch this part now.

Instead, we could just repair any blank symbol to its CURIE "identity" (The "SPAC1039.06" part of "PomBase:SPAC1039.06") when writing a GAF out?

kltm commented 1 year ago

@dustine32 I think part of the issue is a slight dissonance in the data flow. There are supposed to be two formats: GAF and GPAD/GPI. Ideally, for any data source, we're dealing with one or the other and that is the canonical source for information about that source. Easy peasy. In this case, however, we're essentially trying to blend two things and neither is considered canonical the SoT, so a little weird. I kinda feel like the "correct" answer is to get the GPI fixed. In this case, I might advocate fixing the GPI ourselves for this release and then try and work with the upstream to get these changes incorporated. Besides the obvious problems, is there much of a difference between the two?

kltm commented 1 year ago

Talking to @dustine32 , he'll look at fixing the merge function in ontobio.

ValWood commented 11 months ago

Is there a problem with our GAF file?

pgaudet commented 11 months ago

@ValWood I think there is something wrong with your GPI file, see https://github.com/geneontology/go-releases/issues/50#issuecomment-1745482235

Because not all rows have a symbol in column 3, some complicated script generates another file. Maybe it would be cleaner if the GPI file did not have to be reconstructed?

ValWood commented 11 months ago

Yes sure, we did not know about it, or we would fix it.

Tagging @kimrutherford

dustine32 commented 11 months ago

@ValWood @kimrutherford Noting that the current PomBase GPI has 8175 entities with a blank symbol field:

$ curl -L https://www.pombase.org/data/annotations/Gene_ontology/pombase.gpi.gz | gunzip | grep -v -e "^\!" | awk 'BEGIN {FS="\t"};$2==""' | wc -l
    8175

My current (mostly unfounded) suspicion is that QuickGO has a GPI QA step that removes these no-symbol entities from annotation with/from fields.

@pgaudet With https://github.com/biolink/ontobio/pull/648, the GO pipeline is now much more forgiving of no-symbol GPI lines and just fills them in with the object ID to prevent loss of annotations due to the "no symbol in col 3" GO rule check.

kimrutherford commented 11 months ago

Noting that the current PomBase GPI has 8175 entities with a blank symbol field:

Thanks for spotting that. I've fixed our code so the PomBase GPI file should be OK in the morning (UK time).

geneontology / go-releases

Changes in S. pombe annotations #50