geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Pipeline data failure on goa_human_complex file size reduction (GORULE:0000016 violations) #273

Closed kltm closed 2 years ago

kltm commented 2 years ago

Pipeline failure with

18:48:45  ERROR:sanity:Severe reduction of product for: goa_human_complex

Looks like an over 50% reduction in goa_human_complex:

 sjcarbon@moiraine:/tmp$:) wc -l goa_human_complex*
   5561 goa_human_complex-src.gaf
   2413 goa_human_complex_valid.gaf

Looking at the report ( currently http://skyhook.berkeleybop.org/snapshot/reports/goa_human_complex-report.html#gorule-0000016, but will be reset when snapshot tries again tonight), there were 3151 violations of GO Rule 16 (IC with/from reqs). E.g.

ERROR - Violates GO Rule: GORULE:0000016: All IC annotations should include a GO ID in the "With/From" column -- ComplexPortal CPX-1001 calcineurin-calmodulin-gamma1_human enables GO:0033192 GO_REF:0000114 IC ComplexPortal:CPX-1009 F Calcineurin-Calmodulin complex, gamma-R1 variant CALM:PPP3CC:PPP3R1|PP2BC-CANB1-CALM complex|Protein phosphatase 3 complex|Protein phosphatase 2B complex|PP2B complex|CnA-CnB-Calm complex|CALM1:PPP3CC:PPP3R1|CALM2:PPP3CC:PPP3R1|CALM3:PPP3CC:PPP3R1 protein_complex taxon:9606 20200218 ComplexPortal
kltm commented 2 years ago

Tagging @pgaudet and @vanaukenk

kltm commented 2 years ago

The choices are:

  1. this is normal, raise the failure limit for this file
  2. this is not normal, get upstream to fix or filter
  3. the rule needs to be updated so that this no longer fails

I'm assuming that 1 or 2 here is what we're currently looking at.

kltm commented 2 years ago

Of course, there is "4.", using a previous release.

pgaudet commented 2 years ago

@kltm This is very useful. Can you share the error report? Looks like I dont have access to http://skyhook.berkeleybop.org/snapshot

Maybe the ICs are referring to obsolete terms?

vanaukenk commented 2 years ago

Hi - I think this might be an evidence code issue wrt more granular ECO codes like ECO:0005547 mapping up to IC, but the more granular codes don't require a value in the With/From field.

hdrabkin commented 2 years ago

We got an error report of 1506 annotations loaded with NO IC Sample line MGI:104579 Il12rb1 GO:0042022 None Not in the database (translate: MGI id, gene id, GO term used for annotation, IC_id, reason)

The none refers to no value. If it were obsolete , the GO id for the obsolete term would be in the 'none' field, 'Not in the database' would be replaced by 'obsolete' All of the Complex portal annotations we loaded have a blank in our interface.

pgaudet commented 2 years ago

@vanaukenk I think you are right, ECO:0005547 must be causing the problem.

I didn't realize these were new annotations, Birgit did that before leaving. I think we need to sort this out with ECO. Meanwhile we should load the previous file.

Pascale

hdrabkin commented 2 years ago

For what it's worth, the annotations span a broad range of GO ids for the annotation Here are all of them. GO_InvalidInferredFrom.txt .

vanaukenk commented 2 years ago

I think we thought we had sorted this out with ECO:

https://github.com/evidenceontology/evidenceontology/issues/262

but didn't allow for the consequences of having IC annotations with nothing in the With/From field.

We should discuss what we really want at GO before asking ECO to make more changes.

kltm commented 2 years ago

@pgaudet I did not preserve the report; I'll try and catch it next time around (if we get there). You should have access to skyhook and the reports, but keep in mind that they may not always exist as it gets reset every time there is an attempted run. Assuming that we didn't reset, the report would be available later (my) today.

That said, it looks like the current way forward is to use a previous version ("4", decided https://github.com/geneontology/pipeline/issues/273#issuecomment-1040392170).

kltm commented 2 years ago

The metadata has been updated to the last good upstream source we had for goa_human_complex for a release (see PR above). I'll let it run naturally tonight and try and capture the report for today's soon-to-fail run for reference (if it fits into a gist).

kltm commented 2 years ago

@pgaudet Gist of the goa_human_complex report on snapshot on 2022-02-15 https://gist.github.com/kltm/4df75ce4832e0653219ba7c858582fe0

pgaudet commented 2 years ago

Thanks @kltm

@vanaukenk diagnosed this correctly - the evidence used by ComplexProtal is considered an IC but is missing 'with'.

There are only about 10 other annotations that fail this rule; I suggest relaxing that rule to a WARNING for now until we figure out what to do about the evidence code. I dont think it's worth stopping the release for this, or excluding all that data.

Thanks, Pascale

pgaudet commented 2 years ago

This was discusses yesterday on the GO managers' call. @suzialeksander and SGD looked at these errors, and in fact the annotations are experiments done in exogenous systems (more like ISS), but the original source species (what would be in the "with") is not captured. This is against GO rules.

@suzialeksander @vanaukenk Maybe it's OK that these are filtered out?

Thanks, Pascale

vanaukenk commented 2 years ago

@pgaudet @suzialeksander

I would actually prefer we relax the rule to a WARNING and then investigate further.

There is a comment associated with the parent term of the ECO codes used by ComplexPortal that says : "The components in the experimental evidence can come from the same species or a mix of species."

If that is the case for any of the S. cerevisiae complex annotations, then technically they are okay wrt ECO codes.

suzialeksander commented 2 years ago

I'm ok with a warning for now; SGD is discussing if we'll make a hard filter internally or anything today.

pgaudet commented 2 years ago

Can we close this ticket since the work was done here https://github.com/geneontology/go-site/issues/1794

kltm commented 2 years ago

Work now concentrated on https://github.com/geneontology/go-site/issues/1794; closing