Closed RLovering closed 3 months ago
Pascale: probably the paint annotations have already been removed, as they are not on the human UniProt record (P0DPQ6 which is correct) but they are on the mouse (A0A2R8VHR8) in Protein2GO. Thanks Ruth
@RLovering I can't find P0DPQ6 nor A0A2R8VHR8 in Panther.
@dustine32 Can you confirm these genes are no longer in any family ?
Thanks, Pascale
@pgaudet Right, neither of those gene IDs are in PANTHER.
Thanks @dustine32
Thanks for confirming Pascale and Dustin. I assume then that the export from PAINT will remove the existing Protein2GO IBA annotations associated with mouse A0A2R8VHR8, the human P0DPQ6 does not have any IBAs
Best
Ruth
Yes it should. Let's check at the next release.
@dustine32 I still find IBAs to MGI:109247 (A0A2R8VHR8) in AMiGO - http://amigo.geneontology.org/amigo/gene_product/MGI:MGI:109247
Can you please check why these are still there? I dont find this ID in any family in Pantherdb.
Thanks, Pascale
P35639 is also annotated to DNA-binding transcription factor activity Source: UniProtKB
I see one manual annotation to DNA-binding transcription factor activity via an IMP. All the others to this term appear to come from IBA and ISO loads.
@pgaudet @hdrabkin Sorry, my comment above isn't the full story. We do have MGI:109247
in PANTHER and PAINT for both versions 15.0 and 16.0 but it's mapped only to P35639
, and not to A0A2R8VHR8
. This is because there's a filter step in the library build that ensures PANTHER only keeps 1:1 MOD->UniProtKB mappings given situations like this (from the mouse UP000000589_10090.idmapping
file from Reference Proteome):
P35639 MGI MGI:109247
D3YX14 MGI MGI:109247
A0A2R8VHR8 MGI MGI:109247
When there are multiple mapping like this, we currently just choose the longest sequence and it looks like P35639
won. PAINT IBAs are then exported to GO using the MOD ID MGI:MGI:109247
as DB object ID:
MGI MGI:109247 Ddit3 enables GO:0001228 PMID:21873635 IBA PANTHER:PTN002703959|UniProtKB:P35638 F DNA damage-inducible transcript 3 protein UniProtKB:P35639|PTN000423842 protein taxon:10090 20190522 GO_Central
My guess is that elsewhere outside of PANTHER/PAINT (like in QuickGO), these one-to-multiple mappings are used to display the IBAs associated to the multiple UniProt IDs. I can't seem to search annotations by MOD ID (MGI:109247
) in QuickGO like I can in AmiGO, I have to specify the UniProt ID.
@RLovering Are you observing these IBAs to A0A2R8VHR8
only in QuickGO or elsewhere?
@hdrabkin Is MGI getting the IBA annotations directly from GO central?
@alexsign I see A0A2R8VHR8 annotated to GO:0001228 in P2GO; where is this annotation coming from?
Thanks, Pascale
@pgaudet see bellow
REFG | A0A2R8VHR8 | Ddit3 | enables | GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific | ECO:0000318(IBA) | ECO:0000318 | (IBA) | | PMID:21873635 | | PANTHER:PTN002703959|UniPro... ECO:0000318 | (IBA)
MGI | A0A2R8VHR8 | Ddit3 | enables | GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific | ECO:0000266(ISO) | ECO:0000266 | (ISO) | | GO_REF:0000096 | | UniProtKB:P35638 | | | | | | ECO:0000266 | (ISO)
I think Alex has provided a better response than I can
I can provide exact URL for files from MGI and PAINT(REFG) downloaded last weekend in GOA if needed.
@pgaudet Our pipeline for the PAINT: we download the pipeline mgi file (which as the PAINT annotations from GO Central) and then pull out the PAINT annotations (based on the reference). We started doing that some time ago when the PAINT annotation link was broken HOWEVER NOTE: the paint annotations are stripped when GO takes our gaf/gpad and the GOC central get inserted. We only do our PAINT load to provide these for display in MGI.
Note when we load the mouse GOA annotations, we do NOT take the PAINT annotations: only experimental.
@alexsign @dustine32 A0A2R8VHR8 in not in PAINT/Panther - I cannot figure out where the discrepancy is coming from.
@pgaudet the issue is originated here: http://www.informatics.jax.org/downloads/reports/MRK_SwissProt_TrEMBL.rpt MGI:109247 Ddit3 O DNA-damage inducible transcript 3 74.5 10 A0A2R8VHR8 P35639 Q3V405 D3YX14
this mapping goes into UniProt and from there it's used in GOA and Protein2GO. all annotations published for MGI:109247 will be assigned to A0A2R8VHR8, P35639, Q3V405 and D3YX14
Ha !! ok thanks
@hdrabkin Can the TrEMBL entry be removed from the MGI file?
Hi @pgaudet Not sure what you mean about remove the trembl. We annotate to the gene. GOA/UniProt then assigns the gene annotation to all associated SwissProt and Trembl entities (we wish they only do this for the UniProt entry that is assocated with the Ref proteome set.). So are we saying that the trembl should not be associating with the MGI id? Since we get them from UniProt, I think the assignment at UniProt needs to be fixed?
MGI | A0A2R8VHR8 | Ddit3 | enables | GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific | ECO:0000266(ISO) | ECO:0000266 | (ISO) | | GO_REF:0000096 | | UniProtKB:P35638 | | | | | |
ECO:0000266 | (ISO)
Comes from iso load : says that UniProtKB:P35638 has manual annotation to | GO:0001228 | DNA-binding transcription activator activity. IF that went away so would our ISO annotation.
but you should not remove that annotation to P35638, this one is right
Discussing with @hdrabkin
Actually MGI considers A0A2R8VHR8 and P35639 the same gene, since it's another ORF from the same gene.
GOA loads the MGI annotations to both entries because they have mappings to the same MGI ID.
This leads to propagation of the IBA annotations.
This is a general issue for products encoded by the same gene (as these 2 ORFs) that are considered a single gene in MODs.
This would not be an issue if GOA used the mouse ref proteome set; A0A2R8VHR8 is not in that file. I'll send the link in a moment. here https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000589/
A0A2R8VHR8 is NOT in the ref proteome file.
Additionally: I cannot find an MGI manual annotation tor GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific ; I see one IBA and one ISO.
@hdrabkin actually it is part of reference proteome https://www.uniprot.org/uniprot/?query=organism%3A%22Mus+musculus+%28Mouse%29+%5B10090%5D%22+proteome%3Aup000000589+id%3AA0A2R8VHR8&sort=score
Why is t is not included in the file https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000589/UP000000589_10090.fasta.gz ?
it is part of https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000589/UP000000589_10090_additional.fasta.gz I can ask UniProt team why
here is the message from UniProt: "Protein FASTA files (.fasta and _additional.fasta)
These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of ..."). The FASTA format is the standard UniProtKB format"
From Mary Dolan (@mdolanme) in our group, who helped in establishing the mouse ref proteome
"It is not in the UP000000589_10090.fasta file, which I was told some time ago is the source for the reference proteome.
..
The other file UP000000589_10090.gene2acc lists other id associations. The rule here is that the first one listed is "the" reference proteome id. For this case
MGI:109247 P35639 MGI:109247 <<<<<<<
MGI:109247 A0A2R8VHR8 MGI:109247
MGI:109247 D3YX14 MGI:109247"
So if GOA used UP000000589_10090.fasta file for which ids should be used for annotation transfers, A0A2R8VHR8 would not have gotten the annotations?
What is confusing here is that @pgaudet says she can't find A0A2R8VHR8 in the PAINT file (or is that not true)?
Maybe I should check in with Maria Martin?
I assume we dont load '*_additional.fasta' ? @dustine32 is this right?
@hdrabkin Right, A0A2R8VHR8
isn't in the PAINT file (gene_association.paint_mgi.gaf
) since, for mouse, we export IBAs to the MGI ID instead of UniProtKB.
@pgaudet Right, we don't load any '*_additional.fasta' files during the library build.
@pgaudet @hdrabkin I actually don't understand anymore what the issue is and why GOA has to remove anything.
2.P35639 and A0A2R8VHR8 both part of Gene Centric Reference Proteome in UniProt where P35639 is the Canonical and A0A2R8VHR8 is the Isoform
goa_mouse.gpa/gpi/gaf has P35639 goa_mouse_isoform.gpa/gpi/gaf has A0A2R8VHR8
I'm open to any reasonable changes to this process if approved by GO and UniProt consortium
I thought the issue was that somehow A0A2R8VHR8 was getting an annotation to DNA-binding transcription activator activity, RNA polymerase II-specific from somewhere (PAINT?) but shouldn't ?
This is what happens:
P35639 has Isoform AltDDIT3 (identifier: A0A2R8VHR8-1) listed as an external entry, because it differs significantly. What would help would be to avoid mapping to external isoforms. I dont know who can do that - can UniProt supply the list of external isoforms, and these be filtered from MGI mappings?
is it possible that following two can be derived from the same gene transcript:
sp|A0A2R8VHR8|DT3UO_MOUSE DDIT3 upstream open reading frame protein OS=Mus musculus OX=10090 GN=Ddit3 PE=2 SV=1 MLKMSGWQRQSQNNSRNLRRECSRRKCIFIHHHT sp|P35639|DDIT3_MOUSE DNA damage-inducible transcript 3 protein OS=Mus musculus OX=10090 GN=Ddit3 PE=1 SV=1 MAAESLPFTLETVSSWELEAWYEDLQEVLSSDEIGGTYISSPGNEEEESKTFTTLDPASL AWLTEEPGPTEVTRTSQSPRSPDSSQSSMAQEEEEEEQGRTRKRKQSGQCPARPGKQRMK EKEQENERKVAQLAEENERLKQEIERLTREVETTRRALIDRMVSLHQA
alignment looks totally off
well it's an 'upstream orf' - I looked at the paper quickly (PMID:21285359), it seems like that ORF is specifically translated under conditions of stress. So it does come from the same transcript, but the start site is not normally used.
@pgaudet 'MGI states that it corresponds to both P35639 and A0A2R8VHR8, although their sequences have nothing in common ; just derived from the same gene/transcript" We do not state anything other than these are the UniProt ids that map to this gene object. There are other instances of two unrelated SwissProt ids that are coded by one gene that do not have similar sequences. Quite often these have totally different functions but they are still derived from the same gene.
"GOA applies the PAINT annotations to all entries corresponding to MGI:109247, ie both P35639 and A0A2R8VHR" So if GOA used the 'UP000000589_10090.fasta ' reference proteome file (again, we were told this is the source for the reference proteome, instead of the MRK_SwissProt_TrEMBL.rpt then I believe this won't happen?
@hdrabkin this will remove 106716 mouse annotation to the reference proteome proteins, which are vital for proteomics research community. I think GOC should decide on that, not me.
@RLovering The original annotation seems to from from NTNU UniProt: P35638 + PMID:22065586
Should this be removed?
Thanks, Pascale
Hi Pascale
I think it has all been done. The dbTF annotations are now only associated with P35638 + PMID:22065586 - which is correct. DDIT3 is a dbTF, at least it is listed on our dbTF list as a dbTF. Or are you saying DDIT3 is not a dbTF?
There are annotations by MGI to A0A2R8VHR8 which is the Product of the upstream open reading frame (uORF) of DDIT3/CHOP.
A0A2R8VHR8 should not be annotated as a dbTF - to get this changed you need to contact MGI and you need to see if PAINT is creating any annotations to A0A2R8VHR8 based on DNA seq similarity to P35638 - which should not be present.
Ruth
OK, great !
Hi
I think that this protein is causing problems with the annotations. The transcript appears to encode for 2 proteins, one is a dbTF the other is described in UniProt as: Product of the upstream open reading frame (uORF) of DDIT3/CHOP that is specifically produced in absence of stress, thereby preventing translation of downstream stress effector DDIT3/CHOP
Please could the annotations associated with UniProtKB:A0A2R8VHR8 ensembl:ENSMUSP00000155363 MGI:109247, Ddit3
be reviewed and changed so that these are associated with the dbTF DDIT3 Mouse ID:P35639
(note the human ID:P35638 is DDIT3)
There are too many annotations to list here
Thanks
Ruth