geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

MGI GAF includes PRO isoforms as the main annotatable object #2291

Closed cmungall closed 4 months ago

cmungall commented 7 months ago
✗ getgaf mgi | egrep  '\tprotein\t'  | tail
PR  Q9Z2D6-2    mMECP2/iso:2    located_in  GO:0005634  PMID:18334558   IDA     C   methyl-CpG-binding protein 2 isoform 2 (mouse)  mMECP2/iso:2|MECP2b (mouse)|MECP2e1 (mouse) protein taxon:10090 20120126MGI
PR  Q9Z2D6-1    mMECP2/iso:1    located_in  GO:0005634  PMID:18334558   IDA     C   methyl-CpG-binding protein 2 isoform 1 (mouse)  mMECP2/iso:1|MECP2a (mouse)|MECP2e2 (mouse) protein taxon:10090 20101014MGI
PR  Q9Z2D6-2    mMECP2/iso:2    located_in  GO:0005634  PMID:15034150   IDA     C   methyl-CpG-binding protein 2 isoform 2 (mouse)  mMECP2/iso:2|MECP2b (mouse)|MECP2e1 (mouse) protein taxon:10090 20120126MGI
PR  Q9Z2D6-2    mMECP2/iso:2    acts_upstream_of_or_within  GO:0006641  PMID:30137367   IMP MGI:MGI:5584016 P   methyl-CpG-binding protein 2 isoform 2 (mouse)  mMECP2/iso:2|MECP2b (mouse)|MECP2e1 (mouse) proteintaxon:10090  20200304    MGI
PR  Q9Z2D6-1    mMECP2/iso:1    part_of GO:0000792  PMID:18334558   IDA     C   methyl-CpG-binding protein 2 isoform 1 (mouse)  mMECP2/iso:1|MECP2a (mouse)|MECP2e2 (mouse) protein taxon:10090 20201009    MGI
PR  Q9Z2D6-2    mMECP2/iso:2    enables GO:0003682  PMID:18334558   IDA     F   methyl-CpG-binding protein 2 isoform 2 (mouse)  mMECP2/iso:2|MECP2b (mouse)|MECP2e1 (mouse) protein taxon:10090 20120126    MGI
PR  Q08460-4    mKCNMA1/iso:4   enables GO:0015269  PMID:16081418   IDA     F   calcium-activated potassium channel subunit alpha-1 isoform 4 (mouse)   mKCNMA1/iso:4|calcium-activated potassium channel subunit alpha-1 isoform STREX-1 (mouse)   protein taxon:10090 20080813    MGI
PR  Q08460-1    mKCNMA1/iso:1   enables GO:0015269  PMID:16081418   IDA     F   calcium-activated potassium channel subunit alpha-1 isoform 1 (mouse)   mKCNMA1/iso:1   protein taxon:10090 20140331    MGI
PR  Q08460-1    mKCNMA1/iso:1   enables GO:0005249  PMID:16081418   IDA     F   calcium-activated potassium channel subunit alpha-1 isoform 1 (mouse)   mKCNMA1/iso:1   protein taxon:10090 20140331    MGI
PR  Q08460-4    mKCNMA1/iso:4   enables GO:0005249  PMID:16081418   IDA     F   calcium-activated potassium channel subunit alpha-1 isoform 4 (mouse)   mKCNMA1/iso:4|calcium-activated potassium channel subunit alpha-1 isoform STREX-1 (mouse)   protein taxon:10090 20080811    MGI

These should not be here. Instead the annotation should be rolled up to the gene (e.g. Kcnma1 in the case of Q08460), and the isoform should go in column 17

Here is an example of how it should be done:

MGI MGI:1926176 Gas2l1  located_in  GO:0005737  MGI:MGI:3052497|PMID:12584248   IDA     C   growth arrest-specific 2 like 1 4930500E24Rik|D0Jmb1|GAR22|TU-71.1  protein_coding_gene taxon:10090 20120921UniProt part_of(CL:0000586)|part_of(CL:0000017) UniProtKB:Q8JZP9-2

Note the behavior is correct for all uniprot-sourced annotations and incorrect for MGI sourced (which us PRO).

I assume that this is a matter of the roll up code needing to deal with both PRO isoforms and UniProt isoforms. The situation is inherently confusing due to the fact that in many cases the local IDs are the same (e.g. Q08460-4) yet the actual prefixed ID is arbitrarily different

sierra-moxon commented 7 months ago

here's the history of these annotations in ticket-form, I think: https://github.com/geneontology/gopreprocess/issues/36 @LiNiMGI @ukemi can you please comment here as well?

sierra-moxon commented 7 months ago

spoke with Chris and Seth - this appears in the ending GAF file in the main pipeline (current.geneontology.org) but can be fixed in the MGI remainders pipeline to be deployed with the similar fix in ontobio for distributing noctua extensions in the GPAD.

LiNiMGI commented 7 months ago

@sierra-moxon I thought we move the PRO isoform to column 1 in GPAD file as they are the actual gene product for that annotation.

kltm commented 7 months ago

Talking to @pgaudet we believe these are the current options:

  1. Fix the pipeline so that it uses MGI's GPI to look up 'parent' proteins to inform Column 1&2 in GAF and correctly create the GAF (we will need Sierra’s fixed code for this)
    Pros: GAF-compliant
    Cons: More work in the pipeline
  2. Drop any line from GAF that doesn't start with MGI during noctua GAF validation
    Pros: GAF-compliant, definitely easiest
    Cons: lose annotations; “temporary”
  3. Wait for the database to be available
    Pros: cleanest/easiest solution
    Cons: looonger
sierra-moxon commented 7 months ago

@kltm - for 1, this would mean the GAF would be less specific than the GPAD (GPAD would have isoform rows, GAF would have Gene rows - and probably fewer rows as a result - is this ok?). Also, I wanted to be sure to note that these annotations are from Noctua.

LiNiMGI commented 7 months ago

All the example annotations in Chris's comments are from Noctua. @sierra-moxon

For 1, since the isoform information will go in column 17 of GAF, GAF will not be less specific than the GPAD, right? @sierra-moxon

kltm commented 7 months ago

In discussion with @pgaudet, the GAF version, as it's derived from the GPAD sans GPI, is not quite legal. Doing 1 requires for ontobio changes (from Sierra's branch) and pipeline changes. As we don't necessarily want to hold things up, but still give legal output, 2 may be the interim solution.

ukemi commented 6 months ago

My 2cents.

Talking it over with @LiNiMGI. I think if at all possible 1 is the best solution. This will allow for the gaf to contain the proteoform-specific annotations that are normalized in the GPI and GPAD files.

Solution 2 is a possibility, but how long is temporary? During that time, proteoform-level annotations will not be available to our users. Questions to ask wrt whether this is viable:

  1. How long is temporary?
  2. How many Users consume/care about proteoform-level annotations?
  3. How many proteoform-level annotations are represented by gene-level annotations for the same GO information (gene-term-evidence type)?

If these values are all small, then 2 might be ok. Solution 3 doesn't seem feasible unless the database gets really high priority.

cmungall commented 6 months ago

Here is a script that is standalone, takes GAF and GPI as input and replaces isoforms

https://chat.openai.com/share/e559b20c-180c-4212-a84a-e73ed6e955da

pgaudet commented 6 months ago

@cmungall

Are you suggesting a 4th option in which we'd manually 'repair' the GAF after the pipeline products are generated?

pgaudet commented 4 months ago

@sierra-moxon @kltm has there been any movement on this?

pgaudet commented 4 months ago

@kltm confirms that this is not fixed.

kltm commented 4 months ago

After discussion, @sierra-moxon will explore the "ontobio" route and we'll try for the fix there. If difficult, do as post process.