geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

When converting emitting GAF (from Noctua model sources), some ECO codes do not seem to map up correctly #283

Closed kltm closed 2 years ago

kltm commented 2 years ago

Noticed when looking at Xenbase test run data, when converting emitting GAF (from Noctua model sources), some ECO codes do not seem to map up correctly. We'd like to track down the source of this issue (i.e. either GPAD -> GAF conversion issue or direct GAF emission) and fix it.

Given what this is, in either base it seems likely to be related to ontobio (so looping in @dustine32 as well).

Initially reported by @malcolmfisher103

Tagging @vanaukenk to provide some more detail.

kltm commented 2 years ago

Possible dupe with https://github.com/geneontology/pipeline/issues/240?

kltm commented 2 years ago

Depending on our tests, we may forward dupe this with https://github.com/geneontology/go-site/issues/1847#issuecomment-1116743052

pgaudet commented 2 years ago

Hi @kltm

Can you list the problematic ECOs?

Thanks, Pascale

kltm commented 2 years ago

We'll be able to get exhaustive listings (on the other ticket, https://github.com/geneontology/go-site/issues/1847) once we have the fixes from @dustine32 run.

dustine32 commented 2 years ago

@pgaudet @kltm Grepping noctua_mgi-report.md from the 2022-05-08 snapshot run, these are the ECO codes that currently do not directly map to a GAF evidence code (according to gaf-eco-mapping.txt): ECO:0000164 ECO:0000257 ECO:0000279 ECO:0000282 ECO:0001120 ECO:0001152 ECO:0001171 ECO:0001179 ECO:0001183 ECO:0001198 ECO:0001220 ECO:0001225 ECO:0001230 ECO:0001232 ECO:0001242 ECO:0001255 ECO:0005579 ECO:0005581 ECO:0005589 ECO:0005592 ECO:0005593 ECO:0005601 ECO:0005640 ECO:0005644 ECO:0005648 ECO:0005661 ECO:0005801 ECO:0005804 ECO:0005805 ECO:0006003 ECO:0006004 ECO:0006005 ECO:0006006 ECO:0006013 ECO:0006030 ECO:0006039 ECO:0006042 ECO:0006051 ECO:0006052 ECO:0006054 ECO:0006062 ECO:0006063 ECO:0006064 ECO:0006065 ECO:0006067 ECO:0006091 ECO:0007016 ECO:0007054 ECO:0007089 ECO:0007707

However, most of these do indirectly map "up" to a GAF evidence code via gaf-eco-mapping-derived.txt with the following exceptions: ECO:0000164 ECO:0001120 ECO:0005593 ECO:0006065

We could consider using this gaf-eco-mapping-derived.txt in ontobio to fetch GAF evidence codes for most of these.

kltm commented 2 years ago

Okay, so at least the first concrete step would be to get those four exceptions added to the mapping file by submitting an item to https://github.com/evidenceontology/evidenceontology/issues (or otherwise create guidance or a rule to purge these items).

The second step I think is to remember why we don't already use the derived file. We had no small amount of conversation on and around Nov 11th (DMing with @dustine32 and @sierra-moxon on Slack), but final notes are a bit lacking. Reading through that again, it seems that 1) the loss of the GO_REFs and 2) that it is maybe not automatically generated were considered problems. Do we not already have the ECO closure available in ontobio?

pgaudet commented 2 years ago

In the SynGO data, these evidence codes should be replaced so that they map up to IDA:

dustine32 commented 2 years ago

@kltm ontobio currently just accesses the gaf-eco-mapping.txt as a lookup file as the way to handle ECO. We could re-plumb this EcoMap class to handle eco.owl like a real ontology, which would include the GO_REFs (stored as termoboInOwl:hasDbXref) for the full closure set. Downside is it'll likely increase compute time for stuff in ontobio that uses ECO, but maybe not by that much.

kltm commented 2 years ago

@dustine32 Hm. I briefly touched bases with Chris and he also remembers that we had something a little off with the derived file, but doesn't remember the details either.

If replumbing EcoMap is "easy", that is a good way forward. That said, I'm a little worried that it might take a bit of time to get it right and we're hoping to get a release out soon. Another way forward would be to use the derived mapping as a fallback to the non-derived file (or as a replacement for this one use). IIRC, we switched away from using the derived file at some point (no GO_REFs possibly, which we needed?), but I haven't been able to find the documentation for why.

I'm not sure which way would be more expedient, but I think we're leaning towards expedience now and having a ticket opened to give a better or more thorough long-term solution. What are your thoughts on this?

dustine32 commented 2 years ago

(or as a replacement for this one use)

Yeah, I pretty much already coded this yesterday in order to get those four ECOs not in the derived file. Just added a derived flag like this:

ecomap.ecoclass_to_coderef(str(assoc.evidence.type), derived=True)

This change would be incorporated into the gpadparser.py changes made in https://github.com/biolink/ontobio/pull/621.

To repair the downstream GAF lines I can use this derived=True as a fallback in the GafWriter. So, any ECO->GAF mapping already in the non-derived file will be sourced the same (GO_REFs returned) and the derived file will only be used/checked if no mapping found.

I can open a ticket in ontobio for this.

dustine32 commented 2 years ago

@kltm Also, I'll make a separate ticket for the longer-term solution (possibly "ontologizing" ECO in EcoMap).

kltm commented 2 years ago

@dustine32 That sounds like a good plan--thank you!

kltm commented 2 years ago

This should test on the snapshot starting tonight.

kltm commented 2 years ago

Looking at products now; see https://github.com/geneontology/go-site/issues/1847#issuecomment-1126429739, this specific issue may be fixed.

kltm commented 2 years ago

@pgaudet We may have a fix in place for everything except

ECO:0000164
ECO:0001120
ECO:0005593
ECO:0006065

These four may be fixed with https://github.com/geneontology/syngo-go-cams/pull/1, but they are still being tested on noctua-dev (i.e. https://github.com/geneontology/syngo2lego_data_conversion/issues/5)

So my question is whether to

  1. go ahead and attempt a release where we are now, losing the annotations for these four codes from syngo
  2. wait until we can do an update of noctua-models (currently scheduled ~two weeks out) with these syngo updates and then attempt a release
pgaudet commented 2 years ago

@kltm Do you know how many annotations we'd loose by doing the release now, compared to the previous release? or were these already being excluded?

Thanks, Pascale

kltm commented 2 years ago

@pgaudet Honestly, it's probably easiest to just try a release run and check the numbers. I'll go ahead and get that started.

pgaudet commented 2 years ago

We have actually gained 278 SynGO annotations, so it looks like this is resolved?