Closed kltm closed 3 years ago
for key, val in json_changes["summary"]["current"]["annotations"]["by_aspect"].items():
text_report += "\nannotations by aspect " + key + ":\t" + str(val - json_changes["summary"]["previous"]["annotations"]["by_aspect"][key])
Ah:
sjcarbon@moiraine:/tmp/foo$:) cat mgi.gaf | grep -v ^! | cut -f 9 | sort | uniq -c
13
143979 C
127893 F
201351 P
That's 13 lines without aspect...
sjcarbon@moiraine:/tmp/foo$:) cat mgi.gaf | grep -v ^! | grep -v [[:space:]]C[[:space:]] | grep -v [[:space:]]P[[:space:]] | grep -v [[:space:]]F[[:space:]]
MGI MGI:1919103 Pdia6 enables GO:0015035 PMID:24508390 IMP protein disulfide isomerase associated 6 1700015E05Rik|CaBP5|P5|Txndc7 proteintaxon:10090 20180727 WB part_of(GO:1903895),directly_negatively_regulates(GO:0004521)
MGI MGI:101893 Pou5f1 enables GO:0000976 PMID:25901318 IDA POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3 gene taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000976 PMID:25901318 IDA RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:98364 Sox2 enables GO:0000976 PMID:25901318 IDA Transcription factor SOX-2 Sox2|Sox-2 protein taxon:10090 20190517MGI
MGI MGI:102701 Lilrb4a acts_upstream_of_or_within GO:0001816 PMID:15827966 IMP MGI:2653049 leukocyte immunoglobulin-like receptor, subfamily B, member 4A CD85K|Gp49b|HM18|ILT3|Lilrb4 protein taxon:10090 20200121 MGI
MGI MGI:102701 Lilrb4a acts_upstream_of_or_within GO:0032602 PMID:15827966 IMP MGI:2653049 leukocyte immunoglobulin-like receptor, subfamily B, member 4A CD85K|Gp49b|HM18|ILT3|Lilrb4 protein taxon:10090 20200121 MGI
MGI MGI:3651956 Ttc39aos1 enables GO:0000976 PMID:27315481 IDA Ttc39a opposite strand RNA 1 Gm12750 gene taxon:10090 20190710 MGI
MGI MGI:3718576 Mir883b acts_upstream_of_or_within GO:0001816 PMID:23015294 IDA microRNA 883b Mirn883b|mmu-mir-883b gene taxon:10090 20190716 MGI
MGI MGI:3718576 Mir883b acts_upstream_of_or_within GO:0001816 PMID:23015294 IMP microRNA 883b Mirn883b|mmu-mir-883b gene taxon:10090 20190716 MGI
MGI MGI:109484 Ywhaz acts_upstream_of_or_within GO:0000977 PMID:29118970 IMP 14-3-3 protein zeta/delta Ywhaz protein taxon:10090 20190904 MGI
MGI MGI:1333879 Ap3b1 acts_upstream_of_positive_effect GO:0032607 PMID:20847273 IMP adaptor-related protein complex 3, beta 1 subunit AP-3|beta3A|Hps2|recombination induced mutation 2|rim2 protein taxon:10090 20200213 MGI
MGI MGI:107734 Ap3d1 acts_upstream_of_positive_effect GO:0032607 PMID:20847273 IMP adaptor-related protein complex 3, delta 1 subunit Bolvr|mBLVR1 protein taxon:10090 20200213 MGI
MGI MGI:102674 Umod acts_upstream_of_or_within GO:0032602 PMID:28785050 IMP uromodulin Umod gene taxon:10090 20200305 MGI
Looking at that last line, as an example, it seems to come from both mgi_valid.gaf:
MGI MGI:102674 Umod acts_upstream_of_or_within GO:0032602 MGI:MGI:6109043|PMID:28785050 IMP P uromodulin Tamm-Horsfall glycoprotein|Urehd1|urehr4|uromucoid protein taxon:10090 20200305 MGI
and the noctua_mgi.gpad:
MGI MGI:102674 acts_upstream_of_or_within GO:0032602 PMID:28785050 ECO:0000315 20200305 MGI contributor=http://orcid.org/0000-0003-3394-9805|model-state=production|noctua-model-id=gomodel:5df932e000003401
Since this does not occur with just mgi_valid.gaf (i.e. the regular release pipeline), it would seem to be an issue with noctua_mgi.gpad and/or the merge code (and possibly a bit MGI-specific?). Any thoughts on why this might cause the aspect to get knocked out on output? A bad internal model merge? There is a flavor of https://github.com/geneontology/pipeline/issues/240 and https://github.com/geneontology/pipeline/issues/239 here.
Any thoughts @dustine32 @sierra-moxon ?
@kltm Yeah, pretty weird considering that other Noctua GPAD-sourced annotations apparently have their aspects filled in and written out to GAF.
For now, I can at least reproduce this locally.
@dustine32 Do you have a single command (or command set) for reproducing? I had thought we had worked that out somewhere, but I cannot find it any of these three tickets...hrm.
I'm running this locally from ontobio/master
(commit https://github.com/biolink/ontobio/commit/2066457c5c800e44616440f47f2c499719c87574) pointing to local go-site/issue-pipeline-237-mgi-test-pipeline
metadata.
validate.py -v produce mgi --gpad -m ../go-site/metadata/ --target target/ \
--ontology resources/go-lego.json -x goa_uniprot_all -x goa_uniprot_gcrp -x goa_pdb --skip-existing-files \
--gaferencer-file ../go-site/pipeline/target/groups/mgi/mgi.gaferences.json \
--base-download-url http://skyhook.berkeleybop.org/issue-237-mgi-test-pipeline/
@dustine32 Through the magic of staring at a screen, I understand a little more about what's happening.
It looks like the merged noctua_mgi.gpad annotations are the ones that are losing the aspect, rather than the mgi_valid.gaf ones. Moreover, noctua_mgi_valid.gaf also has the missing aspects and the union of mgi_valid.gaf and noctua_mgi_valid.gaf seems to exactly equal mgi.gaf. For a subset of all the files I was taking a look at:
sjcarbon@moiraine:/tmp/foo$:) reset && cat noctua_mgi.gpad | grep enables | grep 25901318 | grep GO:000097MGI MGI:101893 enables GO:0000976 PMID:25901318 ECO:0000314 20190517 MGI contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
MGI MGI:105128 enables GO:0000978 PMID:25901318 ECO:0000314 20190517 MGI contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
MGI MGI:105128 enables GO:0000976 PMID:25901318 ECO:0000314 20190517 MGI contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
MGI MGI:98364 enables GO:0000976 PMID:25901318 ECO:0000314 20190517 MGI contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
sjcarbon@moiraine:/tmp/foo$:) reset && cat noctua_mgi_valid.gaf | grep enables | grep 25901318 | grep GO:000097
MGI MGI:101893 Pou5f1 enables GO:0000976 PMID:25901318 IDA POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3 gene taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000978 PMID:25901318 IDA F RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000976 PMID:25901318 IDA RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:98364 Sox2 enables GO:0000976 PMID:25901318 IDA Transcription factor SOX-2 Sox2|Sox-2 protein taxon:10090 20190517 MGI
sjcarbon@moiraine:/tmp/foo$:) reset && cat mgi_valid.gaf | grep enables | grep 25901318 | grep GO:000097
MGI MGI:101893 Pou5f1 enables GO:0000976 MGI:MGI:5638941|PMID:25901318 IDA F POU domain, class 5, transcription factor 1 Oct-3|Oct-3/4|Oct3/4|Oct4|Oct-4|Otf3|Otf-3|Otf3g|Otf3-rs7|Otf4|Otf-4 protein taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000976 MGI:MGI:5638941|PMID:25901318 IDA F RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000978 MGI:MGI:5638941|PMID:25901318 IDA F RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:98364 Sox2 enables GO:0000976 MGI:MGI:5638941|PMID:25901318 IDA F SRY (sex determining region Y)-box 2 lcc|Sox-2|ysb protein taxon:10090 20190517 MGI
sjcarbon@moiraine:/tmp/foo$:) reset && cat mgi.gaf | grep enables | grep 25901318 | grep GO:000097
MGI MGI:101893 Pou5f1 enables GO:0000976 MGI:MGI:5638941|PMID:25901318 IDA F POU domain, class 5, transcription factor 1 Oct-3|Oct-3/4|Oct3/4|Oct4|Oct-4|Otf3|Otf-3|Otf3g|Otf3-rs7|Otf4|Otf-4 protein taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000976 MGI:MGI:5638941|PMID:25901318 IDA F RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000978 MGI:MGI:5638941|PMID:25901318 IDA F RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:98364 Sox2 enables GO:0000976 MGI:MGI:5638941|PMID:25901318 IDA F SRY (sex determining region Y)-box 2 lcc|Sox-2|ysb protein taxon:10090 20190517 MGI
MGI MGI:101893 Pou5f1 enables GO:0000976 PMID:25901318 IDA POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3 gene taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000978 PMID:25901318 IDA F RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:105128 Rad23b enables GO:0000976 PMID:25901318 IDA RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
MGI MGI:98364 Sox2 enables GO:0000976 PMID:25901318 IDA Transcription factor SOX-2 Sox2|Sox-2 protein taxon:10090 20190517 MGI
This means, I believe, that the issue is happening with the conversion, not necessarily with the merge.
@dustine32 Ugh, I was trying to get an example working using the least possible tooling possible, and I got this (in the pipeline docker image):
cd /tmp
git clone https://github.com/geneontology/go-site.git
cd go-site/pipeline/
pip3 install -r requirements.txt
cd ..
wget http://skyhook.berkeleybop.org/issue-237-mgi-test-pipeline/products/annotations/noctua_mgi.gpad.gz
gunzip noctua_mgi.gpad.gz
ontobio-parse-assocs.py -f noctua_mgi.gpad --format GPAD -o new_2_1.gaf --report-md report.md validate
(ignoring for the moment that the AFAICT proper ontobio-parse-assocs.py -f noctua_mgi.gpad --format gpad --to gaf --format-version 2.2 -o new_2_2.gaf validate
does not work as advertised)
With that, I get a GAF 2.1 (grrr) output file that is bereft of:
I'm kinda wondering how this conversion subsystem is supposed to work at all w/o ontology information (and whether it's possible that this is all down to ontology issues)?
Oh right, ya gotta pass an ontology file into ontobio-parse-assocs.py
with -r
otherwise it'll skip the tests that need the ontology.
Okay, now with:
root@d83c3ebc0015:/tmp# ontobio-parse-assocs.py -f noctua_mgi.gpad --format gpad -o new_2_2.gaf -r go-plus.json validate
It does indeed take longer, but still no:
for any annotation line. Hm.
I'm like 99% certain this "no-aspect" problem is due to the GPAD export annotations being to obsoleted terms (e.g. GO:0044212) and then getting replaced by the replacement term (e.g. GO:0000976).
From watching the debugger, I've pieced together this journey of a Noctua annotation to GO:0044212 originally in noctua_mgi-src.gpad
:
MGI MGI:105128 enables GO:0044212 PMID:25901318 ECO:0000314 20190517 MGI
contributor=http://orcid.org/0000-0002-9796-7693|model-state=production|noctua-model-id=gomodel:5c4605cc00004132
This goes through the ontobio GPAD parser and has its aspect filled as part of the test for GO rule 28. But then afterwards the term is caught by GO rule 20 as obsolete and repaired to the correct GO term GO:0000976. Somehow due to this repair, the aspect
on the annotation is blanked out again. I'm currently trying to track down where exactly.
@kltm You can also try this cmd, which includes a mgi.gpi
:
ontobio-parse-assocs.py --file noctua_mgi.gpad --format GPAD --gpi mgi.gpi -o new_2_2.gaf --report-md noctua_mgi.report.md -r resources/go.json -l "all" convert --to GAF --format-version 2.2
Oh actually the aspect is never filled by GO rule 28 for term GO:0044212 because the ontology doesn't have the hasOBONamespace
property value required to extract aspect through GO rule 28. Lacking this property is prob common for obsolete terms?
Anyhow, I think we want to aim to repair the obsoleted term (via GO rule 20) before attempting to extract the aspect in GO rule 28. This way the correct term (GO:0000976) is in place for aspect extraction. We can experiment with the code change to see if it breaks anything
@dustine32 Ooo--that's some nifty work there. Cheers!
Quick check on these: (all are FROM NOCTUA=a model made using the Noctua interface)
FROM NOCTUA- MGI MGI:1919103 Pdia6 enables GO:0015035 PMID:24508390 IMP protein disulfide isomerase associated 6 1700015E05Rik|CaBP5|P5|Txndc7 proteintaxon:10090 20180727 WB part_of(GO:1903895),directly_negatively_regulates(GO:0004521)
FROM NOCTUA- MGI MGI:101893 Pou5f1 enables GO:0000976 PMID:25901318 IDA POU domain, class 5, transcription factor 1 Pou5f1|Oct-3|Oct-4|Otf-3|Otf3 gene taxon:10090 20190517 MGI
FROM NOCTUA- MGI MGI:105128 Rad23b enables GO:0000976 PMID:25901318 IDA RAD23 homolog B, nucleotide excision repair protein 0610007D13Rik|mHR23B protein taxon:10090 20190517 MGI
FROM NOCTUA- MGI MGI:98364 Sox2 enables GO:0000976 PMID:25901318 IDA Transcription factor SOX-2 Sox2|Sox-2 protein taxon:10090 20190517MGI
FROM NOCTUA- MGI MGI:102701 Lilrb4a acts_upstream_of_or_within GO:0001816 PMID:15827966 IMP MGI:2653049 leukocyte immunoglobulin-like receptor, subfamily B, member 4A CD85K|Gp49b|HM18|ILT3|Lilrb4 protein taxon:10090 20200121 MGI
FROM NOCTUA- MGI MGI:102701 Lilrb4a acts_upstream_of_or_within GO:0032602 PMID:15827966 IMP MGI:2653049 leukocyte immunoglobulin-like receptor, subfamily B, member 4A CD85K|Gp49b|HM18|ILT3|Lilrb4 protein taxon:10090 20200121 MGI
FROM NOCTUA- MGI MGI:3651956 Ttc39aos1 enables GO:0000976 PMID:27315481 IDA Ttc39a opposite strand RNA 1 Gm12750 gene taxon:10090 20190710 MGI
FROM NOCTUA- MGI MGI:3718576 Mir883b acts_upstream_of_or_within GO:0001816 PMID:23015294 IDA microRNA 883b Mirn883b|mmu-mir-883b gene taxon:10090 20190716 MGI
FROM NOCTUA- MGI MGI:3718576 Mir883b acts_upstream_of_or_within GO:0001816 PMID:23015294 IMP microRNA 883b Mirn883b|mmu-mir-883b gene taxon:10090 20190716 MGI
FROM NOCTUA- MGI MGI:109484 Ywhaz acts_upstream_of_or_within GO:0000977 PMID:29118970 IMP 14-3-3 protein zeta/delta Ywhaz protein taxon:10090 20190904 MGI
FROM NOCTUA- MGI MGI:1333879 Ap3b1 acts_upstream_of_positive_effect GO:0032607 PMID:20847273 IMP adaptor-related protein complex 3, beta 1 subunit AP-3|beta3A|Hps2|recombination induced mutation 2|rim2 protein taxon:10090 20200213 MGI
FROM NOCTUA- MGI MGI:107734 Ap3d1 acts_upstream_of_positive_effect GO:0032607 PMID:20847273 IMP adaptor-related protein complex 3, delta 1 subunit Bolvr|mBLVR1 protein taxon:10090 20200213 MGI
FROM NOCTUA- MGI MGI:102674 Umod acts_upstream_of_or_within GO:0032602 PMID:28785050 IMP
I went ahead and fixed all these at source.
@ukemi Thanks!
@kltm I just made a new ontobio
release 2.7.6
with the fix in it. We can edit the go-site
reqs.txt for branch issue-pipeline-237-mgi-test-pipeline
to try it out.
@dustine32 I updated go-site master
with 2.7.6. Test branches for go-site can now update by merging master
.
@kltm and @dustine32 is this solved?
ontobio for go-site branch issue-pipeline-237-mgi-test-pipeline
is now the "correct" 2.7.6. The MGI pipeline branch apparently had a successful run after that date (June 21st), so this is likely completed.
The MGI import pipeline fails during the report generation. This is odd since this does not seem to occur for any of the other resources attempting imports right now (i.e. ZFIN, Xenbase, WB). Hoping that this is an MGI-specific data anomaly. Will do a little inspection to see what turns up.
The error is:
Tagging @ukemi @vanaukenk