Closed sierra-moxon closed 5 months ago
When I run the validate.py produce command in ontobio locally, I do not see the duplicate noctua annotations being generated in the GPAD file:
SMoxon@SMoxon-M82 mgi % grep "MGI:MGI:2685011" mgi.gpad | grep "GO:0009653" | grep "PMID:26258302" mgi.gpad
MGI:MGI:2685011 RO:0002331 GO:0098609 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-07-29 MGI BFO:0000050(GO:0003183),BFO:0000050(GO:0007389),BFO:0000050(GO:0009653),BFO:0000050(GO:0016477),BFO:0000066(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011 RO:0002331 GO:0007389 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-07-29 MGI BFO:0000050(GO:0003183),BFO:0000050(GO:0009653),BFO:0000066(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011 RO:0002331 GO:0003183 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-07-29 MGI contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011 RO:0002331 GO:0072659 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-07-29 MGI BFO:0000066(UBERON:0007151),RO:0002233(MGI:MGI:88355) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011 RO:0002331 GO:0009653 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-07-29 MGI BFO:0000050(GO:0003183),RO:0002298(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011 RO:0002331 GO:0016477 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-07-29 MGI BFO:0000050(GO:0003183),BFO:0000050(GO:0007389),BFO:0000050(GO:0009653),BFO:0000066(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011 RO:0002264 GO:0098609 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-02-11 MGI BFO:0000066(EMAPA:18628) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011 RO:0002264 GO:0016477 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-02-11 MGI BFO:0000066(EMAPA:18628) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011 RO:0002264 GO:0007389 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-02-11 MGI BFO:0000066(EMAPA:18628) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011 RO:0002264 GO:0003183 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-02-11 MGI contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011 RO:0002264 GO:0072659 PMID:26258302 ECO:0000315 MGI:MGI:4867020 2016-02-11 MGI BFO:0000066(EMAPA:18628) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
there must be another step in the pipeline...
reran this command locally today:
poetry run validate produce -m ../go-site/metadata --gpad -t . -o go-basic.json --base-download-url "http://skyhook.berkeleybop.org/snapshot/" --only-dataset mgi MGI
and no dups are generated in the final GPAD file.
I removed the temporary post filter step
from my pipeline to see if it might be generating the dups.
Removing the temporary post filter step
from my pipeline resulting in a GPAD file that had noctua annotations and no duplicates. I passed it on to Li to confirm. If this is the case, then some part of the temporary post filter step is creating duplicates here.
@LiNiMGI - this is the issue about noctua duplication in the lastest file from 3/18/2024.
I sent you the 3/19/2024 files this afternoon with the temporary post filter
step removed from the pipeline, and would like confirmation that you are also no longer seeing duplicates.
This evening, I made a second branch of the full preprocess pipeline here: https://build.geneontology.org/job/geneontology/job/pipeline/job/full-issue-325-ci-gopreprocess/ with the temporary post filter
step re-enabled and here is the resulting GPAD file:
I can't find noctua dups - can you please give me a couple of examples @LiNiMGI - I know they are there, but my sampling doesn't reveal any in either the file above (sent via email with out the temporary post filter
step running) nor with this file: http://skyhook.berkeleybop.org/full-issue-325-ci-gopreprocess/annotations/mgi.gpad.gz (sent with the temporary post filter
running).
Maybe the dups you were seeing were from the rat ortho load instead of noctua?
@sierra-moxon I don't see any noctua duplication in the file from 3/19/2024. nor with this file: http://skyhook.berkeleybop.org/full-issue-325-ci-gopreprocess/annotations/mgi.gpad.gz
I only see noctua duplication in the file from 3/18/2024: MGI:MGI:101757 | RO:0001025 | GO:0005737 | PMID:15548599 | ECO:0000314 | | 11/11/05 | MGI | | noctua-model-id=gomodel:MGI_MGI_101757|model-state=production|contributor=https://orcid.org/0000-0002-9796-7693 MGI:MGI:101757 | RO:0001025 | GO:0005737 | PMID:15548599 | ECO:0000314 | | 3/18/24 | MGI |
MGI:MGI:101757 RO:0001025 GO:0005911 PMID:23793062 ECO:0000314 3/7/14 MGI noctua-model-id=gomodel:MGI_MGI_101757|model-state=production|contributor=https://orcid.org/0000-0003-3394-9805 MGI:MGI:101757 RO:0001025 GO:0005911 PMID:23793062 ECO:0000314 3/18/24 MGI
MGI:MGI:101757 RO:0001025 GO:0005925 PMID:29162887 ECO:0000266 UniProtKB:P23528 1/17/19 MGI noctua-model-id=gomodel:MGI_MGI_101757|model-state=production|contributor=https://orcid.org/0000-0002-9796-7693 MGI:MGI:101757 RO:0001025 GO:0005925 PMID:29162887 ECO:0000266 UniProtKB:P23528 3/18/24 MGI
@sierra-moxon Ok, I thought about it more, I think those Noctua duplication might coming from GOA mouse file, since the annotation date were also changed to 3/18/24. after you tightened the constraint to ignore any annotation from protein to GO provided_by MGI in the import/conversion, the Noctua duplication issue should be fixed? Thanks!
That makes sense to me.
Noctua metadata GPAD emission issue: I see noctua duplicates one line with model ID/contributor/model state and annotation date. one line without model ID/contributor/model state, but with a date: 2024-03-18