geneontology / gopreprocess

MIT License
3 stars 1 forks source link

Should GPAD association writer in ontobio, use GPI files and isoform protein identifiers in associations to modify the subject of annotation in GPAD output? #36

Closed sierra-moxon closed 8 months ago

sierra-moxon commented 9 months ago

From Li's comments here: https://github.com/geneontology/go-site/issues/2043

It looks like we should add code to ontobio so that we can produce GPADs with protein subject identifiers when GAF annotations have isoform identifiers that match ids in the associated GPI file. This is a medium-ish change to the GPAD association writer and would result in GPAD and GAF annotation files with different subjects.

tagging @kltm

snipped from the other ticket for ease of understanding:

========================================== in the GAF file I produce:

SMoxon@SMoxon-M82 ontobio % grep "MGI:87961" mgi_022624.gaf | grep "A2ASQ1-2"
MGI MGI:87961   Agrn    enables GO:0005201  PMID:22159717   RCA     F   Agrin       protein taxon:10090 20180725    BHF-UCL occurs_in(UBERON:0002048)   PR:A2ASQ1-2
MGI MGI:87961   Agrn    located_in  GO:0062023  PMID:22159717   HDA     C   Agrin       protein taxon:10090 20180725    BHF-UCL part_of(UBERON:0002048) PR:A2ASQ1-2

Per David above:

Hi @sierra-moxon The only issue that I see with these gaf lines is the last column. If you switch to: MGI MGI:87961 Agrn enables GO:0005201 PMID:22159717 RCA F Agrin protein taxon:10090 20180725 BHF-UCL occurs_in(UBERON:0002048) PR:A2ASQ1-2 MGI MGI:87961 Agrn located_in GO:0062023 PMID:22159717 HDA C Agrin protein taxon:10090 20180725 BHF-UCL part_of(UBERON:0002048) PR:A2ASQ1-2

I think this will work. We can look together at 3/noon.

this is what I think it looks like in the final GPAD:

MGI:MGI:87961       RO:0002327  GO:0005201  PMID:22159717   ECO:0000245         2018-07-25  BHF-UCL BFO:0000066(UBERON:0002048)

Thanks Sierra @sierra-moxon Gaf file looks good! My understanding is when there are isoform information in the gaf (last column of gaf), we will use the isoform PR:ID as the DB Object ID in the first column of GPAD? Am I right @ukemi ? So final GPAD will looks like: PR:A2ASQ1-2 RO:0002327 GO:0005201 PMID:22159717 ECO:0000245

kltm commented 9 months ago

@kltm Is wondering if this use case can be covered by extension or property? To summarize, the isoform is found in the GAF, but not the GPAD--this is a total loss of information in the GPAD as the GPI file has the mapping, but not the reverse mapping for any given annotation (i.e. many-to-one).

sierra-moxon commented 9 months ago

from managers call: action: pull in the column 17 isoform id, into the subject id field of the GPAD. (for all species. yes.)

discussion: original protein2GO annotation:

SMoxon@SMoxon-M82 GOA_taxon_10090_ISOFORM % grep "A2ASQ1-2" goa_mouse_isoform.gaf 
UniProtKB   A2ASQ1  Agrn    enables GO:0005201  PMID:22159717   RCA     F   Agrin   Agrn|Agrin  protein taxon:10090 20180725    BHF-UCL occurs_in(UBERON:0002048)   UniProtKB:A2ASQ1-2
UniProtKB   A2ASQ1  Agrn    located_in  GO:0062023  PMID:22159717   HDA     C   Agrin   Agrn|Agrin  protein taxon:10090 20180725    BHF-UCL part_of(UBERON:0002048) UniProtKB:A2ASQ1-2
SMoxon@SMoxon-M82 GOA_taxon_10090_ISOFORM % 

sierra is already checking the GPI in the original conversion from uniprot->MGI + PR some discussion of the utility of taking Trembl annotations that can't be mapped to GCRP (MODs handle protein->GCRP mapping)

sierra-moxon commented 9 months ago

examples from Lori (PAINT still have UniProt as the subject) - Sierra does not handle PAINT validation to the GPI.

UniProtKB:P03985
UniProtKB:P18530

this should be another issue somewhere else, not in this "upstream remainders" project. - @LiNiMGI will handle this :)

pgaudet commented 8 months ago

@sierra-moxon What is the action here?

sierra-moxon commented 8 months ago

Hi @pgaudet - this was the action for this ticket, and the fix is in the works in my branch of ontobio used for this project:

from managers call: action: pull in the column 17 isoform id, into the subject id field of the GPAD. (for all species. yes.)

the UniProt comment "(PAINT still have UniProt as the subject)" was another topic that came up tangentially while we were talking about this ticket and so I captured it as aside. I do not know the answer to where this is going to be handled, but Li will have hopefully filed it as an issue elsewhere.

LiNiMGI commented 8 months ago

At the moment MGI filter those PAINT (PAINT still have UniProt as the subject) annotations out. @pgaudet we can talk more tomorrow and see what we can do about it. According to Dustin, the conversion from UniProt to MGI for PAINT annotations is done on the PANTHER side and is tied to likely older data (Reference proteome/QfO releases) than is current in MGI.

sierra-moxon commented 8 months ago

new file generated with fixes: http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/mgi.gpad.gz

SMoxon@SMoxon-M82 pipeline % grep "A2ASQ1-2" ~/Downloads/mgi_0318_24.gpad PR:A2ASQ1-2 RO:0002327 GO:0005201 PMID:22159717 ECO:0000245 2024-03-18 BHF-UCL BFO:0000066(UBERON:0002048) PR:A2ASQ1-2 RO:0001025 GO:0062023 PMID:22159717 ECO:0007005 2024-03-18 BHF-UCL BFO:0000050(UBERON:0002048)