Closed sierra-moxon closed 8 months ago
@kltm Is wondering if this use case can be covered by extension or property? To summarize, the isoform is found in the GAF, but not the GPAD--this is a total loss of information in the GPAD as the GPI file has the mapping, but not the reverse mapping for any given annotation (i.e. many-to-one).
from managers call: action: pull in the column 17 isoform id, into the subject id field of the GPAD. (for all species. yes.)
discussion: original protein2GO annotation:
SMoxon@SMoxon-M82 GOA_taxon_10090_ISOFORM % grep "A2ASQ1-2" goa_mouse_isoform.gaf
UniProtKB A2ASQ1 Agrn enables GO:0005201 PMID:22159717 RCA F Agrin Agrn|Agrin protein taxon:10090 20180725 BHF-UCL occurs_in(UBERON:0002048) UniProtKB:A2ASQ1-2
UniProtKB A2ASQ1 Agrn located_in GO:0062023 PMID:22159717 HDA C Agrin Agrn|Agrin protein taxon:10090 20180725 BHF-UCL part_of(UBERON:0002048) UniProtKB:A2ASQ1-2
SMoxon@SMoxon-M82 GOA_taxon_10090_ISOFORM %
sierra is already checking the GPI in the original conversion from uniprot->MGI + PR some discussion of the utility of taking Trembl annotations that can't be mapped to GCRP (MODs handle protein->GCRP mapping)
examples from Lori (PAINT still have UniProt as the subject) - Sierra does not handle PAINT validation to the GPI.
UniProtKB:P03985
UniProtKB:P18530
this should be another issue somewhere else, not in this "upstream remainders" project. - @LiNiMGI will handle this :)
@sierra-moxon What is the action here?
Hi @pgaudet - this was the action for this ticket, and the fix is in the works in my branch of ontobio used for this project:
from managers call: action: pull in the column 17 isoform id, into the subject id field of the GPAD. (for all species. yes.)
the UniProt comment "(PAINT still have UniProt as the subject)" was another topic that came up tangentially while we were talking about this ticket and so I captured it as aside. I do not know the answer to where this is going to be handled, but Li will have hopefully filed it as an issue elsewhere.
At the moment MGI filter those PAINT (PAINT still have UniProt as the subject) annotations out. @pgaudet we can talk more tomorrow and see what we can do about it. According to Dustin, the conversion from UniProt to MGI for PAINT annotations is done on the PANTHER side and is tied to likely older data (Reference proteome/QfO releases) than is current in MGI.
new file generated with fixes: http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/mgi.gpad.gz
SMoxon@SMoxon-M82 pipeline % grep "A2ASQ1-2" ~/Downloads/mgi_0318_24.gpad PR:A2ASQ1-2 RO:0002327 GO:0005201 PMID:22159717 ECO:0000245 2024-03-18 BHF-UCL BFO:0000066(UBERON:0002048) PR:A2ASQ1-2 RO:0001025 GO:0062023 PMID:22159717 ECO:0007005 2024-03-18 BHF-UCL BFO:0000050(UBERON:0002048)
From Li's comments here: https://github.com/geneontology/go-site/issues/2043
It looks like we should add code to ontobio so that we can produce GPADs with protein subject identifiers when GAF annotations have isoform identifiers that match ids in the associated GPI file. This is a medium-ish change to the GPAD association writer and would result in GPAD and GAF annotation files with different subjects.
tagging @kltm
snipped from the other ticket for ease of understanding:
========================================== in the GAF file I produce:
Per David above:
this is what I think it looks like in the final GPAD:
Thanks Sierra @sierra-moxon Gaf file looks good! My understanding is when there are isoform information in the gaf (last column of gaf), we will use the isoform PR:ID as the DB Object ID in the first column of GPAD? Am I right @ukemi ? So final GPAD will looks like: PR:A2ASQ1-2 RO:0002327 GO:0005201 PMID:22159717 ECO:0000245