Closed cmungall closed 4 months ago
here's the history of these annotations in ticket-form, I think: https://github.com/geneontology/gopreprocess/issues/36 @LiNiMGI @ukemi can you please comment here as well?
spoke with Chris and Seth - this appears in the ending GAF file in the main pipeline (current.geneontology.org) but can be fixed in the MGI remainders pipeline to be deployed with the similar fix in ontobio for distributing noctua extensions in the GPAD.
@sierra-moxon I thought we move the PRO isoform to column 1 in GPAD file as they are the actual gene product for that annotation.
Talking to @pgaudet we believe these are the current options:
@kltm - for 1, this would mean the GAF would be less specific than the GPAD (GPAD would have isoform rows, GAF would have Gene rows - and probably fewer rows as a result - is this ok?). Also, I wanted to be sure to note that these annotations are from Noctua.
All the example annotations in Chris's comments are from Noctua. @sierra-moxon
For 1, since the isoform information will go in column 17 of GAF, GAF will not be less specific than the GPAD, right? @sierra-moxon
In discussion with @pgaudet, the GAF version, as it's derived from the GPAD sans GPI, is not quite legal. Doing 1 requires for ontobio changes (from Sierra's branch) and pipeline changes. As we don't necessarily want to hold things up, but still give legal output, 2 may be the interim solution.
My 2cents.
Talking it over with @LiNiMGI. I think if at all possible 1 is the best solution. This will allow for the gaf to contain the proteoform-specific annotations that are normalized in the GPI and GPAD files.
Solution 2 is a possibility, but how long is temporary? During that time, proteoform-level annotations will not be available to our users. Questions to ask wrt whether this is viable:
If these values are all small, then 2 might be ok. Solution 3 doesn't seem feasible unless the database gets really high priority.
Here is a script that is standalone, takes GAF and GPI as input and replaces isoforms
https://chat.openai.com/share/e559b20c-180c-4212-a84a-e73ed6e955da
@cmungall
Are you suggesting a 4th option in which we'd manually 'repair' the GAF after the pipeline products are generated?
@sierra-moxon @kltm has there been any movement on this?
@kltm confirms that this is not fixed.
After discussion, @sierra-moxon will explore the "ontobio" route and we'll try for the fix there. If difficult, do as post process.
These should not be here. Instead the annotation should be rolled up to the gene (e.g. Kcnma1 in the case of Q08460), and the isoform should go in column 17
Here is an example of how it should be done:
Note the behavior is correct for all uniprot-sourced annotations and incorrect for MGI sourced (which us PRO).
I assume that this is a matter of the roll up code needing to deal with both PRO isoforms and UniProt isoforms. The situation is inherently confusing due to the fact that in many cases the local IDs are the same (e.g. Q08460-4) yet the actual prefixed ID is arbitrarily different