Closed sierra-moxon closed 7 months ago
locally with ontobio/bin/validate.py:
SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gaf | cut -f15 | sort | uniq -c
70068 GO_Central
172971 MGI
10310 SynGO
427 UniProt
5 WB
SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gpad | cut -f10 | sort | uniq -c
70068 GO_Central
172990 MGI
10325 SynGO
427 UniProt
5 WB
SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gpad | wc -l
253815
SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gaf | wc -l
253781
unique to GPAD:
MGI:MGI:1924903 MGI:MGI:2141225 MGI:MGI:3611235 MGI:MGI:3711264 PR:000037777 PR:000037778 PR:P35545 PR:Q6DR99-1
less noctua_mgi_report.json:
{
"level": "ERROR",
"line": "PR\t000037777\tacts_upstream_of_or_within\tGO:0061642\tPMID:18466743\tECO:0000315\t\t\t20141016\tMGI\toccurs_in(CL:0000678),occurs_in(EMAPA:16039)\tcontributor=https://orcid.org/0000-0001-5501-853X|noctua-model-id=gomodel:MGI_MGI_1343102|model-state=production",
"type": "Invalid taxon",
"message": "Taxon is invalid",
"obj": "None or 0",
"taxon": "",
"rule": 1
},
{
"level": "ERROR",
"line": "PR\t000037778\tacts_upstream_of_or_within\tGO:0061643\tPMID:18466743\tECO:0000315\t\t\t20141016\tMGI\toccurs_in(CL:0000678),occurs_in(EMAPA:16039)\tcontributor=https://orcid.org/0000-0001-5501-853X|noctua-model-id=gomodel:MGI_MGI_1343102|model-state=production",
"type": "Invalid taxon",
"message": "Taxon is invalid",
"obj": "None or 0",
"taxon": "",
"rule": 1
},
{
"level": "ERROR",
"line": "MGI\tMGI:1924903\tis_active_in\tGO:0005575\tGO_REF:0000015\tECO:0000307\t\t\t20100209\tMGI\t\tnoctua-model-id=gomodel:MGI_MGI_1924903|model-state=production|contributor=https://orcid.org/0000-0003-3394-9805",
"type": "Invalid taxon",
"message": "Taxon is invalid",
"obj": "None or 0",
"taxon": "",
"rule": 1
},
also noting for MGI:1924903 in the diff is actually a gene that has been merged:
Mir503hg(Mus musculus) Gene Name: Mir503 Mir531 and Mir322 host gene Synonyms: C430049B03Rik, RIKEN cDNA 9430052C07 gene, predicted gene 28730, RIKEN cDNA C430049B03 gene, LncSync, RIKEN cDNA 2700063P19 gene, 9430052C07Rik, Hrtlincrx, Gm28730, 2700063P19Rik Source: MGI:5579436 Biotype: lncRNA gene Secondary I D: MGI:1924903 Allele/Variant (2)[Model (1)]
after pipeline finished late Friday:
SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gaf | cut -f15 | sort | uniq -c 2115 ARUK-UCL 330 AgBase 123 Alzheimers_University_of_Toronto 8512 BHF-UCL 664 CACAO 454 CAFA 6357 ComplexPortal 310 DFLAT 61 DisProt 26156 Ensembl 27 FlyBase 70054 GO_Central 93 HGNC 946 HGNC-UCL 6532 IntAct 5219 InterPro 313284 MGI 768 NTNU_SB 12 PINC 1598 ParkinsonsUK-UCL 1195 RHEA 4825 Reactome 14 Roslin_Institute 15274 SynGO 54 SynGO-UCL 3122 TreeGrafter 91457 UniProt 42 WB 151 YuBioLab 1 dictyBase
SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gpad | cut -f10 | sort | uniq -c 2115 ARUK-UCL 330 AgBase 123 Alzheimers_University_of_Toronto 8512 BHF-UCL 664 CACAO 454 CAFA 6357 ComplexPortal 310 DFLAT 61 DisProt 26156 Ensembl 27 FlyBase 70056 GO_Central 93 HGNC 946 HGNC-UCL 6532 IntAct 5219 InterPro 313303 MGI 768 NTNU_SB 12 PINC 1598 ParkinsonsUK-UCL 1195 RHEA 4825 Reactome 14 Roslin_Institute 15289 SynGO 54 SynGO-UCL 3122 TreeGrafter 91457 UniProt 42 WB 151 YuBioLab 1 dictyBase
SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gaf | wc -l
559750
SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gpad | wc -l 559786
GPAD | GAF 70056 GO_Central | 70054 GO_Central 313303 MGI | 313284 MGI 15289 SynGO | 15274 SynGO
SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-human-ortho-temp.gaf | cut -f6 | sort | uniq -c 105964 GO_REF:0000119 SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-rgd-ortho-temp.gaf | cut -f6 | sort | uniq -c 33999 GO_REF:0000096
SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-rgd-ortho-temp.gaf | cut -f15 | sort | uniq -c 33999 GO_Central SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-human-ortho-temp.gaf | cut -f15 | sort | uniq -c 105964 GO_Central
trying to use gpadparser.parse instead of gpadparse.generate_annotations because it looks like parse passes GPAD annotation through the validation rules...
Noticed a difference in output in GPAD vs. GAF again.
for example, these are in the GPAD but not in the GAF.
MGI:MGI:88501 RO:0002327 GO:0008270 PMID:7999070 ECO:0000250 2000-08-24 MGI contributor=https://orcid.org/0000-0001-7476-6306|noctua-model-id=gomodel:MGI_MGI_88501|model-state=production
MGI:MGI:88501 RO:0002327 GO:0008270 PMID:9126610 ECO:0000250 UniProtKB:P50238 2024-03-19 UniProt
MGI:MGI:88501 RO:0002327 GO:0042277 PMID:18670594 ECO:0000250 UniProtKB:P50238 2024-03-19 UniProt
MGI:MGI:88501 RO:0002331 GO:0008630 PMID:20415737 ECO:0000250 UniProtKB:P50238 2024-03-19 UniProt
MGI:MGI:88501 RO:0002331 GO:0010043 PMID:7999070 ECO:0000250 UniProtKB:P50238 2024-03-19 UniProt
MGI:MGI:88501 RO:0002331 GO:0071236 PMID:20415737 ECO:0000250 UniProtKB:P50238 2024-03-19 UniProt
MGI:MGI:88501 RO:0002331 GO:0071493 PMID:20415737 ECO:0000250 UniProtKB:P50238 2024-03-19 UniProt
MGI:MGI:88501 RO:0001025 GO:0005737 PMID:20415737 ECO:0000250 UniProtKB:P50238 2024-03-19 UniProt
MGI:MGI:88501 RO:0002327 GO:0008270 GO_REF:0000119 ECO:0000266 UniProtKB:P50238 2024-03-19 GO_Central
MGI:MGI:88501 RO:0002327 GO:0042277 GO_REF:0000119 ECO:0000266 UniProtKB:P50238 2024-03-19 GO_Central
MGI:MGI:88501 RO:0002331 GO:0008630 GO_REF:0000119 ECO:0000266 UniProtKB:P50238 2024-03-19 GO_Central
MGI:MGI:88501 RO:0002331 GO:0010043 GO_REF:0000119 ECO:0000266 UniProtKB:P50238 2024-03-19 GO_Central
MGI:MGI:88501 RO:0002331 GO:0071236 GO_REF:0000119 ECO:0000266 UniProtKB:P50238 2024-03-19 GO_Central
MGI:MGI:88501 RO:0002331 GO:0071493 GO_REF:0000119 ECO:0000266 UniProtKB:P50238 2024-03-19 GO_Central
MGI:MGI:88501 RO:0001025 GO:0005737 GO_REF:0000119 ECO:0000266 UniProtKB:P50238 2024-03-19 GO_Central
MGI:MGI:88501 RO:0002327 GO:0008270 GO_REF:0000033 ECO:0000318 PANTHER:PTN002918232|UniProtKB:P50238 2019-04-09 GO_Central
MGI:MGI:88501 RO:0002331 GO:0010468 GO_REF:0000033 ECO:0000318 PANTHER:PTN002918232|UniProtKB:P50238 2019-04-09 GO_Central
MGI:MGI:88501 RO:0002331 GO:0008630 GO_REF:0000033 ECO:0000318 PANTHER:PTN002918232|UniProtKB:P50238 2019-04-09 GO_Central
from the GPI file:
MGI:MGI:88501 Crip1 cysteine-rich protein 1 Crip|CRP1 SO:0001217 NCBITaxon:10090 UniProtKB:P63254
PR:P63254 mCRIP1 cysteine-rich protein 1 (mouse) mCRIP1|CRIP (mouse)|cysteine-rich intestinal protein (mouse) PR:000000001 NCBITaxon:10090 MGI:MGI:88501 UniProtKB:P63254
PR:P63254-1 mCRIP1/iso:1 cysteine-rich protein 1 isoform 1 (mouse) mCRIP1/iso:1 PR:000000001 NCBITaxon:10090 MGI:MGI:88501 UniProtKB:P63254-1
EMBL:BG085110 BG085110 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:BC031922 BC031922 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:BC064074 BC064074 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:AK008269 AK008269 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:AK003075 AK003075 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:AI323004 AI323004 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:BX511593 BX511593 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:AK168305 AK168305 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:BG072276 BG072276 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:AK088267 AK088267 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:M13018 M13018 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:AA266159 AA266159 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:BC058606 BC058606 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:BC030406 BC030406 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:DT908028 DT908028 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
EMBL:AK012068 AK012068 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
RefSeq:NM_007763 NM_007763 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
ENSEMBL:ENSMUST00000198909 ENSMUST00000198909 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
ENSEMBL:ENSMUST00000006523 ENSMUST00000006523 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
ENSEMBL:ENSMUST00000199089 ENSMUST00000199089 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
ENSEMBL:ENSMUST00000198597 ENSMUST00000198597 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
ENSEMBL:ENSMUST00000196932 ENSMUST00000196932 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
ENSEMBL:ENSMUST00000200553 ENSMUST00000200553 SO:0000234 NCBITaxon:10090 MGI:MGI:88501
@sierra-moxon Let me know if you need help troubleshooting. At first glace these annotations look OK. I thought the GAF was the input, so I am not clear how you can have fewer annotations in the GAF relative to the GPAD, but maybe you do something like GOA-GAF >> GOCentral pipeline >> GOC-GAF + GOC-GPAD, and now you're looking at that second GAF ?
Yes, that is exactly right:
Orthology GAFs from human and rat + mgi GOA-GAF >> GO preprocessing pipeline >> GOCentral pipeline >> GOC-GAF + GOC-noctua-GPAD + GOC-paint-GAF >> ontobio >> final GAF and final GPAD for mgi
In the temporary post filter
GOCentral pipeline step, both the GAF and GPAD are passed through all the GORules. I was bypassing that step in my "test pipeline" yesterday while trying to debug another issue (that ended up not being a red herring and was not an issue afterall). This means that the GPAD file was not passed through all the GORules but the GAF file was. In an effort to be extremely sure that the GPAD and GAF both pass through the same rules, I added a step in ontobio yesterday that does the rule check in the megamake
step (in validate.py) as well. We need to run the test pipeline again and confirm that this fixed the problem.
I need to also update my ontobio branch with the changes to rule 63 from master today.
Ran again and have new files where GPAD and GAF are again the same output: http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/ (sent to Li and Lori for testing)
In particular, both should have the same number of lines sans header differences.