geneontology / gopreprocess

MIT License
3 stars 1 forks source link

ensure mgi.gpad and mgi.gaf produced by the pipeline have equivalent output #32

Closed sierra-moxon closed 7 months ago

sierra-moxon commented 8 months ago

In particular, both should have the same number of lines sans header differences.

sierra-moxon commented 8 months ago

locally with ontobio/bin/validate.py:

SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gaf | cut -f15 | sort | uniq -c
70068 GO_Central
172971 MGI
10310 SynGO
 427 UniProt
   5 WB
SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gpad | cut -f10 | sort | uniq -c
70068 GO_Central
172990 MGI
10325 SynGO
 427 UniProt
   5 WB
sierra-moxon commented 8 months ago

SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gpad | wc -l
253815 SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi.gaf | wc -l 253781

sierra-moxon commented 8 months ago

unique to GPAD:

MGI:MGI:1924903 MGI:MGI:2141225 MGI:MGI:3611235 MGI:MGI:3711264 PR:000037777 PR:000037778 PR:P35545 PR:Q6DR99-1

sierra-moxon commented 8 months ago

less noctua_mgi_report.json:

{
                "level": "ERROR",
                "line": "PR\t000037777\tacts_upstream_of_or_within\tGO:0061642\tPMID:18466743\tECO:0000315\t\t\t20141016\tMGI\toccurs_in(CL:0000678),occurs_in(EMAPA:16039)\tcontributor=https://orcid.org/0000-0001-5501-853X|noctua-model-id=gomodel:MGI_MGI_1343102|model-state=production",
                "type": "Invalid taxon",
                "message": "Taxon is invalid",
                "obj": "None or 0",
                "taxon": "",
                "rule": 1
            },
            {
                "level": "ERROR",
                "line": "PR\t000037778\tacts_upstream_of_or_within\tGO:0061643\tPMID:18466743\tECO:0000315\t\t\t20141016\tMGI\toccurs_in(CL:0000678),occurs_in(EMAPA:16039)\tcontributor=https://orcid.org/0000-0001-5501-853X|noctua-model-id=gomodel:MGI_MGI_1343102|model-state=production",
                "type": "Invalid taxon",
                "message": "Taxon is invalid",
                "obj": "None or 0",
                "taxon": "",
                "rule": 1
            },
            {
                "level": "ERROR",
                "line": "MGI\tMGI:1924903\tis_active_in\tGO:0005575\tGO_REF:0000015\tECO:0000307\t\t\t20100209\tMGI\t\tnoctua-model-id=gomodel:MGI_MGI_1924903|model-state=production|contributor=https://orcid.org/0000-0003-3394-9805",
                "type": "Invalid taxon",
                "message": "Taxon is invalid",
                "obj": "None or 0",
                "taxon": "",
                "rule": 1
            },
sierra-moxon commented 8 months ago

also noting for MGI:1924903 in the diff is actually a gene that has been merged:

Mir503hg(Mus musculus) Gene Name: Mir503 Mir531 and Mir322 host gene Synonyms: C430049B03Rik, RIKEN cDNA 9430052C07 gene, predicted gene 28730, RIKEN cDNA C430049B03 gene, LncSync, RIKEN cDNA 2700063P19 gene, 9430052C07Rik, Hrtlincrx, Gm28730, 2700063P19Rik Source: MGI:5579436 Biotype: lncRNA gene Secondary I D: MGI:1924903 Allele/Variant (2)[Model (1)]

sierra-moxon commented 8 months ago

after pipeline finished late Friday:

SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gaf | cut -f15 | sort | uniq -c 2115 ARUK-UCL 330 AgBase 123 Alzheimers_University_of_Toronto 8512 BHF-UCL 664 CACAO 454 CAFA 6357 ComplexPortal 310 DFLAT 61 DisProt 26156 Ensembl 27 FlyBase 70054 GO_Central 93 HGNC 946 HGNC-UCL 6532 IntAct 5219 InterPro 313284 MGI 768 NTNU_SB 12 PINC 1598 ParkinsonsUK-UCL 1195 RHEA 4825 Reactome 14 Roslin_Institute 15274 SynGO 54 SynGO-UCL 3122 TreeGrafter 91457 UniProt 42 WB 151 YuBioLab 1 dictyBase

SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gpad | cut -f10 | sort | uniq -c 2115 ARUK-UCL 330 AgBase 123 Alzheimers_University_of_Toronto 8512 BHF-UCL 664 CACAO 454 CAFA 6357 ComplexPortal 310 DFLAT 61 DisProt 26156 Ensembl 27 FlyBase 70056 GO_Central 93 HGNC 946 HGNC-UCL 6532 IntAct 5219 InterPro 313303 MGI 768 NTNU_SB 12 PINC 1598 ParkinsonsUK-UCL 1195 RHEA 4825 Reactome 14 Roslin_Institute 15289 SynGO 54 SynGO-UCL 3122 TreeGrafter 91457 UniProt 42 WB 151 YuBioLab 1 dictyBase

sierra-moxon commented 8 months ago

SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gaf | wc -l
559750

SMoxon@SMoxon-M82 ontobio % grep -v '^!' mgi_022624.gpad | wc -l 559786

sierra-moxon commented 8 months ago

GPAD | GAF 70056 GO_Central |   70054 GO_Central 313303 MGI |   313284 MGI 15289 SynGO |   15274 SynGO

sierra-moxon commented 8 months ago

SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-human-ortho-temp.gaf | cut -f6 | sort | uniq -c 105964 GO_REF:0000119 SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-rgd-ortho-temp.gaf | cut -f6 | sort | uniq -c 33999 GO_REF:0000096

sierra-moxon commented 8 months ago

SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-rgd-ortho-temp.gaf | cut -f15 | sort | uniq -c 33999 GO_Central SMoxon@SMoxon-M82 mgi % grep -v '^!' mgi-human-ortho-temp.gaf | cut -f15 | sort | uniq -c 105964 GO_Central

sierra-moxon commented 7 months ago

trying to use gpadparser.parse instead of gpadparse.generate_annotations because it looks like parse passes GPAD annotation through the validation rules...

sierra-moxon commented 7 months ago

Noticed a difference in output in GPAD vs. GAF again.

for example, these are in the GPAD but not in the GAF.

MGI:MGI:88501       RO:0002327  GO:0008270  PMID:7999070    ECO:0000250         2000-08-24  MGI     contributor=https://orcid.org/0000-0001-7476-6306|noctua-model-id=gomodel:MGI_MGI_88501|model-state=production
MGI:MGI:88501       RO:0002327  GO:0008270  PMID:9126610    ECO:0000250 UniProtKB:P50238        2024-03-19  UniProt     
MGI:MGI:88501       RO:0002327  GO:0042277  PMID:18670594   ECO:0000250 UniProtKB:P50238        2024-03-19  UniProt     
MGI:MGI:88501       RO:0002331  GO:0008630  PMID:20415737   ECO:0000250 UniProtKB:P50238        2024-03-19  UniProt     
MGI:MGI:88501       RO:0002331  GO:0010043  PMID:7999070    ECO:0000250 UniProtKB:P50238        2024-03-19  UniProt     
MGI:MGI:88501       RO:0002331  GO:0071236  PMID:20415737   ECO:0000250 UniProtKB:P50238        2024-03-19  UniProt     
MGI:MGI:88501       RO:0002331  GO:0071493  PMID:20415737   ECO:0000250 UniProtKB:P50238        2024-03-19  UniProt     
MGI:MGI:88501       RO:0001025  GO:0005737  PMID:20415737   ECO:0000250 UniProtKB:P50238        2024-03-19  UniProt     
MGI:MGI:88501       RO:0002327  GO:0008270  GO_REF:0000119  ECO:0000266 UniProtKB:P50238        2024-03-19  GO_Central      
MGI:MGI:88501       RO:0002327  GO:0042277  GO_REF:0000119  ECO:0000266 UniProtKB:P50238        2024-03-19  GO_Central      
MGI:MGI:88501       RO:0002331  GO:0008630  GO_REF:0000119  ECO:0000266 UniProtKB:P50238        2024-03-19  GO_Central      
MGI:MGI:88501       RO:0002331  GO:0010043  GO_REF:0000119  ECO:0000266 UniProtKB:P50238        2024-03-19  GO_Central      
MGI:MGI:88501       RO:0002331  GO:0071236  GO_REF:0000119  ECO:0000266 UniProtKB:P50238        2024-03-19  GO_Central      
MGI:MGI:88501       RO:0002331  GO:0071493  GO_REF:0000119  ECO:0000266 UniProtKB:P50238        2024-03-19  GO_Central      
MGI:MGI:88501       RO:0001025  GO:0005737  GO_REF:0000119  ECO:0000266 UniProtKB:P50238        2024-03-19  GO_Central      
MGI:MGI:88501       RO:0002327  GO:0008270  GO_REF:0000033  ECO:0000318 PANTHER:PTN002918232|UniProtKB:P50238       2019-04-09  GO_Central      
MGI:MGI:88501       RO:0002331  GO:0010468  GO_REF:0000033  ECO:0000318 PANTHER:PTN002918232|UniProtKB:P50238       2019-04-09  GO_Central      
MGI:MGI:88501       RO:0002331  GO:0008630  GO_REF:0000033  ECO:0000318 PANTHER:PTN002918232|UniProtKB:P50238       2019-04-09  GO_Central
sierra-moxon commented 7 months ago

from the GPI file:

MGI:MGI:88501   Crip1   cysteine-rich protein 1 Crip|CRP1   SO:0001217  NCBITaxon:10090             UniProtKB:P63254    
PR:P63254   mCRIP1  cysteine-rich protein 1 (mouse) mCRIP1|CRIP (mouse)|cysteine-rich intestinal protein (mouse)    PR:000000001    NCBITaxon:10090 MGI:MGI:88501       UniProtKB:P63254    
PR:P63254-1 mCRIP1/iso:1    cysteine-rich protein 1 isoform 1 (mouse)   mCRIP1/iso:1    PR:000000001    NCBITaxon:10090 MGI:MGI:88501           UniProtKB:P63254-1  
EMBL:BG085110   BG085110            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:BC031922   BC031922            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:BC064074   BC064074            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:AK008269   AK008269            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:AK003075   AK003075            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:AI323004   AI323004            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:BX511593   BX511593            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:AK168305   AK168305            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:BG072276   BG072276            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:AK088267   AK088267            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:M13018 M13018          SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:AA266159   AA266159            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:BC058606   BC058606            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:BC030406   BC030406            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:DT908028   DT908028            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
EMBL:AK012068   AK012068            SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
RefSeq:NM_007763    NM_007763           SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
ENSEMBL:ENSMUST00000198909  ENSMUST00000198909          SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
ENSEMBL:ENSMUST00000006523  ENSMUST00000006523          SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
ENSEMBL:ENSMUST00000199089  ENSMUST00000199089          SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
ENSEMBL:ENSMUST00000198597  ENSMUST00000198597          SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
ENSEMBL:ENSMUST00000196932  ENSMUST00000196932          SO:0000234  NCBITaxon:10090 MGI:MGI:88501               
ENSEMBL:ENSMUST00000200553  ENSMUST00000200553          SO:0000234  NCBITaxon:10090 MGI:MGI:88501                  
pgaudet commented 7 months ago

@sierra-moxon Let me know if you need help troubleshooting. At first glace these annotations look OK. I thought the GAF was the input, so I am not clear how you can have fewer annotations in the GAF relative to the GPAD, but maybe you do something like GOA-GAF >> GOCentral pipeline >> GOC-GAF + GOC-GPAD, and now you're looking at that second GAF ?

sierra-moxon commented 7 months ago

Yes, that is exactly right:

Orthology GAFs from human and rat + mgi GOA-GAF >> GO preprocessing pipeline >> GOCentral pipeline >> GOC-GAF + GOC-noctua-GPAD + GOC-paint-GAF >> ontobio >> final GAF and final GPAD for mgi

In the temporary post filter GOCentral pipeline step, both the GAF and GPAD are passed through all the GORules. I was bypassing that step in my "test pipeline" yesterday while trying to debug another issue (that ended up not being a red herring and was not an issue afterall). This means that the GPAD file was not passed through all the GORules but the GAF file was. In an effort to be extremely sure that the GPAD and GAF both pass through the same rules, I added a step in ontobio yesterday that does the rule check in the megamake step (in validate.py) as well. We need to run the test pipeline again and confirm that this fixed the problem.

I need to also update my ontobio branch with the changes to rule 63 from master today.

sierra-moxon commented 7 months ago

Ran again and have new files where GPAD and GAF are again the same output: http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/ (sent to Li and Lori for testing)