geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

MGI import pipeline fails during report generation #242

Closed kltm closed 3 years ago

kltm commented 3 years ago

The MGI import pipeline fails during the report generation. This is odd since this does not seem to occur for any of the other resources attempting imports right now (i.e. ZFIN, Xenbase, WB). Hoping that this is an MGI-specific data anomaly. Will do a little inspection to see what turns up.

The error is:

01:20:39  4c - SAVING GO-ANNOTATION-CHANGES...
01:20:39  
01:20:39  added references:      68
01:20:39  removed references:    206296
01:20:39  added pmids:           15
01:20:39  removed pmids:         131109
01:20:39  Traceback (most recent call last):
01:20:39    File "/tmp/go_reports.py", line 293, in <module>
01:20:39      main(sys.argv[1:])
01:20:39    File "/tmp/go_reports.py", line 214, in main
01:20:39      tsv_annot_changes = go_annotation_changes.create_text_report(json_annot_changes)
01:20:39    File "/tmp/go_annotation_changes.py", line 221, in create_text_report
01:20:39      text_report += "\nannotations by aspect " + key + ":\t" + str(val - json_changes["summary"]["previous"]["annotations"]["by_aspect"][key])
01:20:39  KeyError: ''
[Pipeline] }

Tagging @ukemi @vanaukenk

kltm commented 3 years ago
    for key, val in json_changes["summary"]["current"]["annotations"]["by_aspect"].items():
        text_report += "\nannotations by aspect " + key + ":\t" + str(val - json_changes["summary"]["previous"]["annotations"]["by_aspect"][key])

Ah:

sjcarbon@moiraine:/tmp/foo$:) cat mgi.gaf | grep -v ^! | cut -f 9 | sort | uniq -c
     13 
 143979 C
 127893 F
 201351 P

That's 13 lines without aspect...

sjcarbon@moiraine:/tmp/foo$:) cat mgi.gaf | grep -v ^! | grep -v [[:space:]]C[[:space:]] | grep -v [[:space:]]P[[:space:]] | grep -v [[:space:]]F[[:space:]]
MGI MGI:1919103 Pdia6   enables GO:0015035  PMID:24508390   IMP protein disulfide isomerase associated 6    1700015E05Rik|CaBP5|P5|Txndc7   proteintaxon:10090  20180727    WB  part_of(GO:1903895),directly_negatively_regulates(GO:0004521)   
MGI MGI:101893  Pou5f1  enables GO:0000976  PMID:25901318   IDA POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3   gene    taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000976  PMID:25901318   IDA RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:98364   Sox2    enables GO:0000976  PMID:25901318   IDA Transcription factor SOX-2  Sox2|Sox-2  protein taxon:10090 20190517MGI     
MGI MGI:102701  Lilrb4a acts_upstream_of_or_within  GO:0001816  PMID:15827966   IMP MGI:2653049     leukocyte immunoglobulin-like receptor, subfamily B, member 4A  CD85K|Gp49b|HM18|ILT3|Lilrb4    protein taxon:10090 20200121    MGI     
MGI MGI:102701  Lilrb4a acts_upstream_of_or_within  GO:0032602  PMID:15827966   IMP MGI:2653049     leukocyte immunoglobulin-like receptor, subfamily B, member 4A  CD85K|Gp49b|HM18|ILT3|Lilrb4    protein taxon:10090 20200121    MGI     
MGI MGI:3651956 Ttc39aos1   enables GO:0000976  PMID:27315481   IDA         Ttc39a opposite strand RNA 1    Gm12750 gene    taxon:10090 20190710    MGI     
MGI MGI:3718576 Mir883b acts_upstream_of_or_within  GO:0001816  PMID:23015294   IDA         microRNA 883b   Mirn883b|mmu-mir-883b   gene    taxon:10090 20190716    MGI     
MGI MGI:3718576 Mir883b acts_upstream_of_or_within  GO:0001816  PMID:23015294   IMP         microRNA 883b   Mirn883b|mmu-mir-883b   gene    taxon:10090 20190716    MGI     
MGI MGI:109484  Ywhaz   acts_upstream_of_or_within  GO:0000977  PMID:29118970   IMP         14-3-3 protein zeta/delta   Ywhaz   protein taxon:10090 20190904    MGI     
MGI MGI:1333879 Ap3b1   acts_upstream_of_positive_effect    GO:0032607  PMID:20847273   IMP         adaptor-related protein complex 3, beta 1 subunit   AP-3|beta3A|Hps2|recombination induced mutation 2|rim2  protein taxon:10090 20200213    MGI     
MGI MGI:107734  Ap3d1   acts_upstream_of_positive_effect    GO:0032607  PMID:20847273   IMP         adaptor-related protein complex 3, delta 1 subunit  Bolvr|mBLVR1    protein taxon:10090 20200213    MGI     
MGI MGI:102674  Umod    acts_upstream_of_or_within  GO:0032602  PMID:28785050   IMP         uromodulin  Umod    gene    taxon:10090 20200305    MGI     

Looking at that last line, as an example, it seems to come from both mgi_valid.gaf:

MGI MGI:102674  Umod    acts_upstream_of_or_within  GO:0032602  MGI:MGI:6109043|PMID:28785050   IMP     P   uromodulin  Tamm-Horsfall glycoprotein|Urehd1|urehr4|uromucoid  protein taxon:10090 20200305    MGI     

and the noctua_mgi.gpad:

MGI MGI:102674  acts_upstream_of_or_within  GO:0032602  PMID:28785050   ECO:0000315         20200305    MGI     contributor=http://orcid.org/0000-0003-3394-9805|model-state=production|noctua-model-id=gomodel:5df932e000003401

Since this does not occur with just mgi_valid.gaf (i.e. the regular release pipeline), it would seem to be an issue with noctua_mgi.gpad and/or the merge code (and possibly a bit MGI-specific?). Any thoughts on why this might cause the aspect to get knocked out on output? A bad internal model merge? There is a flavor of https://github.com/geneontology/pipeline/issues/240 and https://github.com/geneontology/pipeline/issues/239 here.

Any thoughts @dustine32 @sierra-moxon ?

dustine32 commented 3 years ago

@kltm Yeah, pretty weird considering that other Noctua GPAD-sourced annotations apparently have their aspects filled in and written out to GAF.

For now, I can at least reproduce this locally.

kltm commented 3 years ago

@dustine32 Do you have a single command (or command set) for reproducing? I had thought we had worked that out somewhere, but I cannot find it any of these three tickets...hrm.

dustine32 commented 3 years ago

I'm running this locally from ontobio/master (commit https://github.com/biolink/ontobio/commit/2066457c5c800e44616440f47f2c499719c87574) pointing to local go-site/issue-pipeline-237-mgi-test-pipeline metadata.

validate.py -v produce mgi --gpad -m ../go-site/metadata/ --target target/ \
--ontology resources/go-lego.json -x goa_uniprot_all -x goa_uniprot_gcrp  -x goa_pdb --skip-existing-files \
--gaferencer-file ../go-site/pipeline/target/groups/mgi/mgi.gaferences.json \
--base-download-url http://skyhook.berkeleybop.org/issue-237-mgi-test-pipeline/
kltm commented 3 years ago

@dustine32 Through the magic of staring at a screen, I understand a little more about what's happening.

It looks like the merged noctua_mgi.gpad annotations are the ones that are losing the aspect, rather than the mgi_valid.gaf ones. Moreover, noctua_mgi_valid.gaf also has the missing aspects and the union of mgi_valid.gaf and noctua_mgi_valid.gaf seems to exactly equal mgi.gaf. For a subset of all the files I was taking a look at:

sjcarbon@moiraine:/tmp/foo$:) reset && cat noctua_mgi.gpad | grep enables | grep 25901318 | grep GO:000097MGI   MGI:101893  enables GO:0000976  PMID:25901318   ECO:0000314         20190517    MGI     contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
MGI MGI:105128  enables GO:0000978  PMID:25901318   ECO:0000314         20190517    MGI     contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
MGI MGI:105128  enables GO:0000976  PMID:25901318   ECO:0000314         20190517    MGI     contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
MGI MGI:98364   enables GO:0000976  PMID:25901318   ECO:0000314         20190517    MGI     contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
sjcarbon@moiraine:/tmp/foo$:) reset && cat noctua_mgi_valid.gaf | grep enables | grep 25901318 | grep GO:000097
MGI MGI:101893  Pou5f1  enables GO:0000976  PMID:25901318   IDA         POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3   gene    taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000978  PMID:25901318   IDA     F   RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000976  PMID:25901318   IDA         RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:98364   Sox2    enables GO:0000976  PMID:25901318   IDA         Transcription factor SOX-2  Sox2|Sox-2  protein taxon:10090 20190517    MGI     
sjcarbon@moiraine:/tmp/foo$:) reset && cat mgi_valid.gaf | grep enables | grep 25901318 | grep GO:000097
MGI MGI:101893  Pou5f1  enables GO:0000976  MGI:MGI:5638941|PMID:25901318   IDA     F   POU domain, class 5, transcription factor 1 Oct-3|Oct-3/4|Oct3/4|Oct4|Oct-4|Otf3|Otf-3|Otf3g|Otf3-rs7|Otf4|Otf-4    protein taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000976  MGI:MGI:5638941|PMID:25901318   IDA     F   RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000978  MGI:MGI:5638941|PMID:25901318   IDA     F   RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:98364   Sox2    enables GO:0000976  MGI:MGI:5638941|PMID:25901318   IDA     F   SRY (sex determining region Y)-box 2    lcc|Sox-2|ysb   protein taxon:10090 20190517    MGI     
sjcarbon@moiraine:/tmp/foo$:) reset && cat mgi.gaf | grep enables | grep 25901318 | grep GO:000097
MGI MGI:101893  Pou5f1  enables GO:0000976  MGI:MGI:5638941|PMID:25901318   IDA     F   POU domain, class 5, transcription factor 1 Oct-3|Oct-3/4|Oct3/4|Oct4|Oct-4|Otf3|Otf-3|Otf3g|Otf3-rs7|Otf4|Otf-4    protein taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000976  MGI:MGI:5638941|PMID:25901318   IDA     F   RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000978  MGI:MGI:5638941|PMID:25901318   IDA     F   RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:98364   Sox2    enables GO:0000976  MGI:MGI:5638941|PMID:25901318   IDA     F   SRY (sex determining region Y)-box 2    lcc|Sox-2|ysb   protein taxon:10090 20190517    MGI     
MGI MGI:101893  Pou5f1  enables GO:0000976  PMID:25901318   IDA         POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3   gene    taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000978  PMID:25901318   IDA     F   RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:105128  Rad23b  enables GO:0000976  PMID:25901318   IDA         RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
MGI MGI:98364   Sox2    enables GO:0000976  PMID:25901318   IDA         Transcription factor SOX-2  Sox2|Sox-2  protein taxon:10090 20190517    MGI     

This means, I believe, that the issue is happening with the conversion, not necessarily with the merge.

kltm commented 3 years ago

@dustine32 Ugh, I was trying to get an example working using the least possible tooling possible, and I got this (in the pipeline docker image):

cd /tmp
git clone https://github.com/geneontology/go-site.git
cd go-site/pipeline/
pip3 install -r requirements.txt 
cd ..
wget http://skyhook.berkeleybop.org/issue-237-mgi-test-pipeline/products/annotations/noctua_mgi.gpad.gz
gunzip noctua_mgi.gpad.gz 
ontobio-parse-assocs.py -f noctua_mgi.gpad --format GPAD -o new_2_1.gaf --report-md report.md validate

(ignoring for the moment that the AFAICT proper ontobio-parse-assocs.py -f noctua_mgi.gpad --format gpad --to gaf --format-version 2.2 -o new_2_2.gaf validate does not work as advertised)

With that, I get a GAF 2.1 (grrr) output file that is bereft of:

I'm kinda wondering how this conversion subsystem is supposed to work at all w/o ontology information (and whether it's possible that this is all down to ontology issues)?

dustine32 commented 3 years ago

Oh right, ya gotta pass an ontology file into ontobio-parse-assocs.py with -r otherwise it'll skip the tests that need the ontology.

kltm commented 3 years ago

Okay, now with: root@d83c3ebc0015:/tmp# ontobio-parse-assocs.py -f noctua_mgi.gpad --format gpad -o new_2_2.gaf -r go-plus.json validate It does indeed take longer, but still no:

for any annotation line. Hm.

dustine32 commented 3 years ago

I'm like 99% certain this "no-aspect" problem is due to the GPAD export annotations being to obsoleted terms (e.g. GO:0044212) and then getting replaced by the replacement term (e.g. GO:0000976).

From watching the debugger, I've pieced together this journey of a Noctua annotation to GO:0044212 originally in noctua_mgi-src.gpad:

MGI     MGI:105128      enables GO:0044212      PMID:25901318   ECO:0000314                     20190517        MGI
        contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132

This goes through the ontobio GPAD parser and has its aspect filled as part of the test for GO rule 28. But then afterwards the term is caught by GO rule 20 as obsolete and repaired to the correct GO term GO:0000976. Somehow due to this repair, the aspect on the annotation is blanked out again. I'm currently trying to track down where exactly.

dustine32 commented 3 years ago

@kltm You can also try this cmd, which includes a mgi.gpi:

ontobio-parse-assocs.py --file noctua_mgi.gpad --format GPAD --gpi mgi.gpi -o new_2_2.gaf --report-md noctua_mgi.report.md -r resources/go.json -l "all" convert --to GAF --format-version 2.2
dustine32 commented 3 years ago

Oh actually the aspect is never filled by GO rule 28 for term GO:0044212 because the ontology doesn't have the hasOBONamespace property value required to extract aspect through GO rule 28. Lacking this property is prob common for obsolete terms?

Anyhow, I think we want to aim to repair the obsoleted term (via GO rule 20) before attempting to extract the aspect in GO rule 28. This way the correct term (GO:0000976) is in place for aspect extraction. We can experiment with the code change to see if it breaks anything

kltm commented 3 years ago

@dustine32 Ooo--that's some nifty work there. Cheers!

ukemi commented 3 years ago

Quick check on these: (all are FROM NOCTUA=a model made using the Noctua interface)

FROM NOCTUA- MGI    MGI:1919103 Pdia6   enables GO:0015035  PMID:24508390   IMP protein disulfide isomerase associated 6    1700015E05Rik|CaBP5|P5|Txndc7   proteintaxon:10090  20180727    WB  part_of(GO:1903895),directly_negatively_regulates(GO:0004521)   
FROM NOCTUA- MGI    MGI:101893  Pou5f1  enables GO:0000976  PMID:25901318   IDA POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3   gene    taxon:10090 20190517    MGI     
FROM NOCTUA- MGI    MGI:105128  Rad23b  enables GO:0000976  PMID:25901318   IDA RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B    protein taxon:10090 20190517    MGI     
FROM NOCTUA- MGI    MGI:98364   Sox2    enables GO:0000976  PMID:25901318   IDA Transcription factor SOX-2  Sox2|Sox-2  protein taxon:10090 20190517MGI     
FROM NOCTUA- MGI    MGI:102701  Lilrb4a acts_upstream_of_or_within  GO:0001816  PMID:15827966   IMP MGI:2653049     leukocyte immunoglobulin-like receptor, subfamily B, member 4A  CD85K|Gp49b|HM18|ILT3|Lilrb4    protein taxon:10090 20200121    MGI     
FROM NOCTUA- MGI    MGI:102701  Lilrb4a acts_upstream_of_or_within  GO:0032602  PMID:15827966   IMP MGI:2653049     leukocyte immunoglobulin-like receptor, subfamily B, member 4A  CD85K|Gp49b|HM18|ILT3|Lilrb4    protein taxon:10090 20200121    MGI     
FROM NOCTUA- MGI    MGI:3651956 Ttc39aos1   enables GO:0000976  PMID:27315481   IDA         Ttc39a opposite strand RNA 1    Gm12750 gene    taxon:10090 20190710    MGI     
FROM NOCTUA- MGI    MGI:3718576 Mir883b acts_upstream_of_or_within  GO:0001816  PMID:23015294   IDA         microRNA 883b   Mirn883b|mmu-mir-883b   gene    taxon:10090 20190716    MGI     
FROM NOCTUA- MGI    MGI:3718576 Mir883b acts_upstream_of_or_within  GO:0001816  PMID:23015294   IMP         microRNA 883b   Mirn883b|mmu-mir-883b   gene    taxon:10090 20190716    MGI     
FROM NOCTUA- MGI    MGI:109484  Ywhaz   acts_upstream_of_or_within  GO:0000977  PMID:29118970   IMP         14-3-3 protein zeta/delta   Ywhaz   protein taxon:10090 20190904    MGI     
FROM NOCTUA- MGI    MGI:1333879 Ap3b1   acts_upstream_of_positive_effect    GO:0032607  PMID:20847273   IMP         adaptor-related protein complex 3, beta 1 subunit   AP-3|beta3A|Hps2|recombination induced mutation 2|rim2  protein taxon:10090 20200213    MGI     
FROM NOCTUA- MGI    MGI:107734  Ap3d1   acts_upstream_of_positive_effect    GO:0032607  PMID:20847273   IMP         adaptor-related protein complex 3, delta 1 subunit  Bolvr|mBLVR1    protein taxon:10090 20200213    MGI     
FROM NOCTUA- MGI    MGI:102674  Umod    acts_upstream_of_or_within  GO:0032602  PMID:28785050   IMP 
ukemi commented 3 years ago

I went ahead and fixed all these at source.

dustine32 commented 3 years ago

@ukemi Thanks!

@kltm I just made a new ontobio release 2.7.6 with the fix in it. We can edit the go-site reqs.txt for branch issue-pipeline-237-mgi-test-pipeline to try it out.

kltm commented 3 years ago

@dustine32 I updated go-site master with 2.7.6. Test branches for go-site can now update by merging master.

ukemi commented 3 years ago

@kltm and @dustine32 is this solved?

kltm commented 3 years ago

ontobio for go-site branch issue-pipeline-237-mgi-test-pipeline is now the "correct" 2.7.6. The MGI pipeline branch apparently had a successful run after that date (June 21st), so this is likely completed.