geneontology / gopreprocess

MIT License
3 stars 1 forks source link

noctua annotations being duplicated #59

Closed sierra-moxon closed 5 months ago

sierra-moxon commented 6 months ago

Noctua metadata GPAD emission issue: I see noctua duplicates one line with model ID/contributor/model state and annotation date. one line without model ID/contributor/model state, but with a date: 2024-03-18

sierra-moxon commented 6 months ago

When I run the validate.py produce command in ontobio locally, I do not see the duplicate noctua annotations being generated in the GPAD file:

SMoxon@SMoxon-M82 mgi % grep "MGI:MGI:2685011" mgi.gpad | grep "GO:0009653" | grep "PMID:26258302" mgi.gpad
MGI:MGI:2685011     RO:0002331  GO:0098609  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-07-29  MGI BFO:0000050(GO:0003183),BFO:0000050(GO:0007389),BFO:0000050(GO:0009653),BFO:0000050(GO:0016477),BFO:0000066(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011     RO:0002331  GO:0007389  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-07-29  MGI BFO:0000050(GO:0003183),BFO:0000050(GO:0009653),BFO:0000066(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011     RO:0002331  GO:0003183  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-07-29  MGI     contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011     RO:0002331  GO:0072659  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-07-29  MGI BFO:0000066(UBERON:0007151),RO:0002233(MGI:MGI:88355)   contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011     RO:0002331  GO:0009653  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-07-29  MGI BFO:0000050(GO:0003183),RO:0002298(UBERON:0007151)  contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011     RO:0002331  GO:0016477  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-07-29  MGI BFO:0000050(GO:0003183),BFO:0000050(GO:0007389),BFO:0000050(GO:0009653),BFO:0000066(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038
MGI:MGI:2685011     RO:0002264  GO:0098609  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-02-11  MGI BFO:0000066(EMAPA:18628)    contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011     RO:0002264  GO:0016477  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-02-11  MGI BFO:0000066(EMAPA:18628)    contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011     RO:0002264  GO:0007389  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-02-11  MGI BFO:0000066(EMAPA:18628)    contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011     RO:0002264  GO:0003183  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-02-11  MGI     contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011
MGI:MGI:2685011     RO:0002264  GO:0072659  PMID:26258302   ECO:0000315 MGI:MGI:4867020     2016-02-11  MGI BFO:0000066(EMAPA:18628)    contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:MGI_MGI_2685011

there must be another step in the pipeline...

sierra-moxon commented 6 months ago

reran this command locally today:

poetry run validate produce -m ../go-site/metadata --gpad -t . -o go-basic.json --base-download-url "http://skyhook.berkeleybop.org/snapshot/" --only-dataset mgi MGI

and no dups are generated in the final GPAD file.

I removed the temporary post filter step from my pipeline to see if it might be generating the dups.

sierra-moxon commented 6 months ago

Removing the temporary post filter step from my pipeline resulting in a GPAD file that had noctua annotations and no duplicates. I passed it on to Li to confirm. If this is the case, then some part of the temporary post filter step is creating duplicates here.

sierra-moxon commented 6 months ago

@LiNiMGI - this is the issue about noctua duplication in the lastest file from 3/18/2024.
I sent you the 3/19/2024 files this afternoon with the temporary post filter step removed from the pipeline, and would like confirmation that you are also no longer seeing duplicates.

sierra-moxon commented 6 months ago

This evening, I made a second branch of the full preprocess pipeline here: https://build.geneontology.org/job/geneontology/job/pipeline/job/full-issue-325-ci-gopreprocess/ with the temporary post filter step re-enabled and here is the resulting GPAD file:

I can't find noctua dups - can you please give me a couple of examples @LiNiMGI - I know they are there, but my sampling doesn't reveal any in either the file above (sent via email with out the temporary post filter step running) nor with this file: http://skyhook.berkeleybop.org/full-issue-325-ci-gopreprocess/annotations/mgi.gpad.gz (sent with the temporary post filter running).

Maybe the dups you were seeing were from the rat ortho load instead of noctua?

LiNiMGI commented 6 months ago

@sierra-moxon I don't see any noctua duplication in the file from 3/19/2024. nor with this file: http://skyhook.berkeleybop.org/full-issue-325-ci-gopreprocess/annotations/mgi.gpad.gz

I only see noctua duplication in the file from 3/18/2024: MGI:MGI:101757 | RO:0001025 | GO:0005737 | PMID:15548599 | ECO:0000314 |   | 11/11/05 | MGI |   | noctua-model-id=gomodel:MGI_MGI_101757|model-state=production|contributor=https://orcid.org/0000-0002-9796-7693 MGI:MGI:101757 | RO:0001025 | GO:0005737 | PMID:15548599 | ECO:0000314 |   | 3/18/24 | MGI |

MGI:MGI:101757 RO:0001025 GO:0005911 PMID:23793062 ECO:0000314 3/7/14 MGI noctua-model-id=gomodel:MGI_MGI_101757|model-state=production|contributor=https://orcid.org/0000-0003-3394-9805 MGI:MGI:101757 RO:0001025 GO:0005911 PMID:23793062 ECO:0000314 3/18/24 MGI

MGI:MGI:101757 RO:0001025 GO:0005925 PMID:29162887 ECO:0000266 UniProtKB:P23528 1/17/19 MGI noctua-model-id=gomodel:MGI_MGI_101757|model-state=production|contributor=https://orcid.org/0000-0002-9796-7693 MGI:MGI:101757 RO:0001025 GO:0005925 PMID:29162887 ECO:0000266 UniProtKB:P23528 3/18/24 MGI

LiNiMGI commented 6 months ago

@sierra-moxon Ok, I thought about it more, I think those Noctua duplication might coming from GOA mouse file, since the annotation date were also changed to 3/18/24. after you tightened the constraint to ignore any annotation from protein to GO provided_by MGI in the import/conversion, the Noctua duplication issue should be fixed? Thanks!

sierra-moxon commented 6 months ago

That makes sense to me.