geneontology / gopreprocess

MIT License
3 stars 1 forks source link

RGD annotations being imported via orthology file correctly, but duplicated by the protein to GO load (and consequently these duplicates are provided_by "MGI") #58

Closed sierra-moxon closed 5 months ago

sierra-moxon commented 6 months ago

Thanks!!! @sierra-moxon I did a quick check:

GO_REF:0000096 J:155856 Rat to mouse Date fixed, assign by still not fixed (change MGI to GO_Central) GO_REF:0000119.

sierra-moxon commented 6 months ago

Noting that the human and rat orthology loads use the same code that sets the provided_by to "GO_Central" in the preprocessing pipeline. And, when I look at the GAF file for the human and rat outputs of the preprocessing pipeline here: http://skyhook.berkeleybop.org/silver-issue-325-gopreprocess/products/upstream_and_raw_data/preprocessed_GAF_output/mgi-human-ortho.gaf and http://skyhook.berkeleybop.org/silver-issue-325-gopreprocess/products/upstream_and_raw_data/preprocessed_GAF_output/mgi-rgd-ortho.gaf

they both show ONLY GO_Central as the provider (as expected).

sierra-moxon commented 6 months ago

SMoxon@SMoxon-M82 gopreprocess % grep "MGI:2179523" mgi-merged.gaf.2 | grep "GO_REF:0000096"
MGI MGI:2179523 Fcgr4 involved_in GO:0071222 GO_REF:0000096 ISO RGD:1303067 P Low affinity immunoglobulin gamma Fc region receptor III-A protein taxon:10090 20240318 MGI
MGI MGI:2179523 Fcgr4 located_in GO:0009986 GO_REF:0000096 ISO RGD:1303067 C Low affinity immunoglobulin gamma Fc region receptor III-A protein taxon:10090 20240318 MGI
MGI MGI:2179523 Fcgr4 involved_in GO:0071222 GO_REF:0000096 ISO RGD:1303067 P Fc receptor, IgG, low affinity IV gene_product taxon:10090 20240318 GO_Central
MGI MGI:2179523 Fcgr4 located_in GO:0009986 GO_REF:0000096 ISO RGD:1303067 C Fc receptor, IgG, low affinity IV gene_product taxon:10090 20240318 GO_Central
SMoxon@SMoxon-M82 gopreprocess %

sierra-moxon commented 6 months ago

this seems to be the issue:

SMoxon@SMoxon-M82 GAF_OUTPUT % grep "MGI:2179523" *.gaf | grep "GO_REF:0000096" 
mgi-p2g-converted.gaf:MGI   MGI:2179523 Fcgr4   involved_in GO:0071222  GO_REF:0000096  ISO RGD:1303067 P   Low affinity immunoglobulin gamma Fc region receptor III-A      protein taxon:10090 20240318    MGI     
mgi-p2g-converted.gaf:MGI   MGI:2179523 Fcgr4   located_in  GO:0009986  GO_REF:0000096  ISO RGD:1303067 C   Low affinity immunoglobulin gamma Fc region receptor III-A      protein taxon:10090 20240318    MGI     
mgi-rgd-ortho.gaf:MGI   MGI:2179523 Fcgr4   involved_in GO:0071222  GO_REF:0000096  ISO RGD:1303067 P   Fc receptor, IgG, low affinity IV       gene_product    taxon:10090 20240318    GO_Central      
mgi-rgd-ortho.gaf:MGI   MGI:2179523 Fcgr4   located_in  GO:0009986  GO_REF:0000096  ISO RGD:1303067 C   Fc receptor, IgG, low affinity IV       gene_product    taxon:10090 20240318    GO_Central

in the rat ortho load, we get the annotations with the correct provided by. in the protein-to-go load, we get the same annotations, but the requirements for that load are to keep the provided_by the same as what came in via protein to go.

sierra-moxon commented 6 months ago

here are the two RGD annotations in the goa_mouse file:

UniProtKB   A0A0B4J1G0  Fcgr4   involved_in GO:0071222  GO_REF:0000096  ISO RGD:1303067 P   Low affinity immunoglobulin gamma Fc region receptor III-A  Fcgr4|Fcgr3a    protein taxon:10090 20111011    MGI     
UniProtKB   A0A0B4J1G0  Fcgr4   located_in  GO:0009986  GO_REF:0000096  ISO RGD:1303067 C   Low affinity immunoglobulin gamma Fc region receptor III-A  Fcgr4|Fcgr3a    protein taxon:10090 20111011    MGI
sierra-moxon commented 6 months ago

I convert the date, and I swap the identifiers per requirements in the GOA load, but I don't move the provided_by.
Should these two Protein to GO annotations be removed via some constraint from the final GAF file? @LiNiMGI

I imagine that these are coming in now, with the loosened constraint on the Protein to GO load, where we wanted things annotated by MGI, GO_Central, or GOC as long as they didn't have this reference: "GO_REF:0000033"

Did I misinterpret that new constraint? Should I always exclude MGI provided annotations in the protein to GO conversion and only keep those provided by GO_Central or GOC when they don't have this reference: "GO_REF:0000033" ?

sierra-moxon commented 6 months ago

https://github.com/geneontology/gopreprocess/pull/60 <-- I tightened the constraint to ignore any annotation from protein to GO provided_by MGI in the import/conversion, but if the annotation is from GO_Central or GOC, then check for the "GO_REF:0000033" and only bring in those from GOC or GO_Central that do not have this reference.

LiNiMGI commented 6 months ago

@sierra-moxon only bring in those "assign by" GO_Central that do not have "GO_REF:0000033" reference. MGI should still be exclude as before. Thanks!

sierra-moxon commented 6 months ago

Li has confirmed that these are fixed in the lastest round of tests.