geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Product of the upstream open reading frame (uORF) of DDIT3/CHOP not a dbTF #3282

Closed RLovering closed 3 months ago

RLovering commented 4 years ago

Hi

I think that this protein is causing problems with the annotations. The transcript appears to encode for 2 proteins, one is a dbTF the other is described in UniProt as: Product of the upstream open reading frame (uORF) of DDIT3/CHOP that is specifically produced in absence of stress, thereby preventing translation of downstream stress effector DDIT3/CHOP

Please could the annotations associated with UniProtKB:A0A2R8VHR8 ensembl:ENSMUSP00000155363 MGI:109247, Ddit3

be reviewed and changed so that these are associated with the dbTF DDIT3 Mouse ID:P35639

(note the human ID:P35638 is DDIT3)

There are too many annotations to list here

Thanks

Ruth

RLovering commented 4 years ago

Pascale: probably the paint annotations have already been removed, as they are not on the human UniProt record (P0DPQ6 which is correct) but they are on the mouse (A0A2R8VHR8) in Protein2GO. Thanks Ruth

pgaudet commented 4 years ago

@RLovering I can't find P0DPQ6 nor A0A2R8VHR8 in Panther.

@dustine32 Can you confirm these genes are no longer in any family ?

Thanks, Pascale

dustine32 commented 4 years ago

@pgaudet Right, neither of those gene IDs are in PANTHER.

pgaudet commented 4 years ago

Thanks @dustine32

RLovering commented 4 years ago

Thanks for confirming Pascale and Dustin. I assume then that the export from PAINT will remove the existing Protein2GO IBA annotations associated with mouse A0A2R8VHR8, the human P0DPQ6 does not have any IBAs

Best

Ruth

pgaudet commented 4 years ago

Yes it should. Let's check at the next release.

pgaudet commented 3 years ago

@dustine32 I still find IBAs to MGI:109247 (A0A2R8VHR8) in AMiGO - http://amigo.geneontology.org/amigo/gene_product/MGI:MGI:109247

Can you please check why these are still there? I dont find this ID in any family in Pantherdb.

Thanks, Pascale

pgaudet commented 3 years ago

P35639 is also annotated to DNA-binding transcription factor activity Source: UniProtKB

hdrabkin commented 3 years ago

I see one manual annotation to DNA-binding transcription factor activity via an IMP. All the others to this term appear to come from IBA and ISO loads.

dustine32 commented 3 years ago

@pgaudet @hdrabkin Sorry, my comment above isn't the full story. We do have MGI:109247 in PANTHER and PAINT for both versions 15.0 and 16.0 but it's mapped only to P35639, and not to A0A2R8VHR8. This is because there's a filter step in the library build that ensures PANTHER only keeps 1:1 MOD->UniProtKB mappings given situations like this (from the mouse UP000000589_10090.idmapping file from Reference Proteome):

P35639  MGI MGI:109247
D3YX14  MGI MGI:109247
A0A2R8VHR8  MGI MGI:109247

When there are multiple mapping like this, we currently just choose the longest sequence and it looks like P35639 won. PAINT IBAs are then exported to GO using the MOD ID MGI:MGI:109247 as DB object ID:

MGI MGI:109247  Ddit3   enables GO:0001228  PMID:21873635   IBA PANTHER:PTN002703959|UniProtKB:P35638   F   DNA damage-inducible transcript 3 protein   UniProtKB:P35639|PTN000423842   protein taxon:10090 20190522    GO_Central

My guess is that elsewhere outside of PANTHER/PAINT (like in QuickGO), these one-to-multiple mappings are used to display the IBAs associated to the multiple UniProt IDs. I can't seem to search annotations by MOD ID (MGI:109247) in QuickGO like I can in AmiGO, I have to specify the UniProt ID.

@RLovering Are you observing these IBAs to A0A2R8VHR8 only in QuickGO or elsewhere?

pgaudet commented 3 years ago

@hdrabkin Is MGI getting the IBA annotations directly from GO central?

pgaudet commented 3 years ago

@alexsign I see A0A2R8VHR8 annotated to GO:0001228 in P2GO; where is this annotation coming from?

Thanks, Pascale

alexsign commented 3 years ago

@pgaudet see bellow

REFG | A0A2R8VHR8 | Ddit3 | enables | GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific | ECO:0000318(IBA) | ECO:0000318 | (IBA) |   | PMID:21873635 |   | PANTHER:PTN002703959|UniPro... ECO:0000318 | (IBA)

MGI | A0A2R8VHR8 | Ddit3 | enables | GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific | ECO:0000266(ISO) | ECO:0000266 | (ISO) |   | GO_REF:0000096 |   | UniProtKB:P35638 |   |   |   |   |   |   ECO:0000266 | (ISO)

RLovering commented 3 years ago

I think Alex has provided a better response than I can

alexsign commented 3 years ago

I can provide exact URL for files from MGI and PAINT(REFG) downloaded last weekend in GOA if needed.

hdrabkin commented 3 years ago

@pgaudet Our pipeline for the PAINT: we download the pipeline mgi file (which as the PAINT annotations from GO Central) and then pull out the PAINT annotations (based on the reference). We started doing that some time ago when the PAINT annotation link was broken HOWEVER NOTE: the paint annotations are stripped when GO takes our gaf/gpad and the GOC central get inserted. We only do our PAINT load to provide these for display in MGI.

Note when we load the mouse GOA annotations, we do NOT take the PAINT annotations: only experimental.

pgaudet commented 3 years ago

@alexsign @dustine32 A0A2R8VHR8 in not in PAINT/Panther - I cannot figure out where the discrepancy is coming from.

alexsign commented 3 years ago

@pgaudet the issue is originated here: http://www.informatics.jax.org/downloads/reports/MRK_SwissProt_TrEMBL.rpt MGI:109247 Ddit3 O DNA-damage inducible transcript 3 74.5 10 A0A2R8VHR8 P35639 Q3V405 D3YX14

this mapping goes into UniProt and from there it's used in GOA and Protein2GO. all annotations published for MGI:109247 will be assigned to A0A2R8VHR8, P35639, Q3V405 and D3YX14

pgaudet commented 3 years ago

Ha !! ok thanks

@hdrabkin Can the TrEMBL entry be removed from the MGI file?

hdrabkin commented 3 years ago

Hi @pgaudet Not sure what you mean about remove the trembl. We annotate to the gene. GOA/UniProt then assigns the gene annotation to all associated SwissProt and Trembl entities (we wish they only do this for the UniProt entry that is assocated with the Ref proteome set.). So are we saying that the trembl should not be associating with the MGI id? Since we get them from UniProt, I think the assignment at UniProt needs to be fixed?

hdrabkin commented 3 years ago

MGI | A0A2R8VHR8 | Ddit3 | enables | GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific | ECO:0000266(ISO) | ECO:0000266 | (ISO) | | GO_REF:0000096 | | UniProtKB:P35638 | | | | | |
ECO:0000266 | (ISO)

Comes from iso load : says that UniProtKB:P35638 has manual annotation to | GO:0001228 | DNA-binding transcription activator activity. IF that went away so would our ISO annotation.

pgaudet commented 3 years ago

but you should not remove that annotation to P35638, this one is right

pgaudet commented 3 years ago

Discussing with @hdrabkin

hdrabkin commented 3 years ago

This would not be an issue if GOA used the mouse ref proteome set; A0A2R8VHR8 is not in that file. I'll send the link in a moment. here https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000589/

A0A2R8VHR8 is NOT in the ref proteome file.

Additionally: I cannot find an MGI manual annotation tor GO:0001228 | DNA-binding transcription activator activity, RNA polymerase II-specific ; I see one IBA and one ISO.

alexsign commented 3 years ago

@hdrabkin actually it is part of reference proteome https://www.uniprot.org/uniprot/?query=organism%3A%22Mus+musculus+%28Mouse%29+%5B10090%5D%22+proteome%3Aup000000589+id%3AA0A2R8VHR8&sort=score

hdrabkin commented 3 years ago

Why is t is not included in the file https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000589/UP000000589_10090.fasta.gz ?

alexsign commented 3 years ago

it is part of https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000589/UP000000589_10090_additional.fasta.gz I can ask UniProt team why

alexsign commented 3 years ago

here is the message from UniProt: "Protein FASTA files (.fasta and _additional.fasta)

These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of ..."). The FASTA format is the standard UniProtKB format"

hdrabkin commented 3 years ago

From Mary Dolan (@mdolanme) in our group, who helped in establishing the mouse ref proteome

"It is not in the UP000000589_10090.fasta file, which I was told some time ago is the source for the reference proteome. .. The other file UP000000589_10090.gene2acc lists other id associations. The rule here is that the first one listed is "the" reference proteome id. For this case MGI:109247 P35639 MGI:109247 <<<<<<< MGI:109247 A0A2R8VHR8 MGI:109247
MGI:109247 D3YX14 MGI:109247"

So if GOA used UP000000589_10090.fasta file for which ids should be used for annotation transfers, A0A2R8VHR8 would not have gotten the annotations?

What is confusing here is that @pgaudet says she can't find A0A2R8VHR8 in the PAINT file (or is that not true)?

Maybe I should check in with Maria Martin?

pgaudet commented 3 years ago

I assume we dont load '*_additional.fasta' ? @dustine32 is this right?

dustine32 commented 3 years ago

@hdrabkin Right, A0A2R8VHR8 isn't in the PAINT file (gene_association.paint_mgi.gaf) since, for mouse, we export IBAs to the MGI ID instead of UniProtKB.

@pgaudet Right, we don't load any '*_additional.fasta' files during the library build.

alexsign commented 3 years ago

@pgaudet @hdrabkin I actually don't understand anymore what the issue is and why GOA has to remove anything.

  1. MGI id: MGI:109247 linked to P35639 and A0A2R8VHR8 here from comments above: "Discussing with @hdrabkin Actually MGI considers A0A2R8VHR8 and P35639 the same gene, since it's another ORF from the same gene. GOA loads the MGI annotations to both entries because they have mappings to the same MGI ID.

2.P35639 and A0A2R8VHR8 both part of Gene Centric Reference Proteome in UniProt where P35639 is the Canonical and A0A2R8VHR8 is the Isoform

  1. GOA do make annotations to both and publish them in separate files:

goa_mouse.gpa/gpi/gaf has P35639 goa_mouse_isoform.gpa/gpi/gaf has A0A2R8VHR8

I'm open to any reasonable changes to this process if approved by GO and UniProt consortium

hdrabkin commented 3 years ago

I thought the issue was that somehow A0A2R8VHR8 was getting an annotation to DNA-binding transcription activator activity, RNA polymerase II-specific from somewhere (PAINT?) but shouldn't ?

pgaudet commented 3 years ago

This is what happens:

P35639 has Isoform AltDDIT3 (identifier: A0A2R8VHR8-1) listed as an external entry, because it differs significantly. What would help would be to avoid mapping to external isoforms. I dont know who can do that - can UniProt supply the list of external isoforms, and these be filtered from MGI mappings?

alexsign commented 3 years ago

is it possible that following two can be derived from the same gene transcript:

sp|A0A2R8VHR8|DT3UO_MOUSE DDIT3 upstream open reading frame protein OS=Mus musculus OX=10090 GN=Ddit3 PE=2 SV=1 MLKMSGWQRQSQNNSRNLRRECSRRKCIFIHHHT sp|P35639|DDIT3_MOUSE DNA damage-inducible transcript 3 protein OS=Mus musculus OX=10090 GN=Ddit3 PE=1 SV=1 MAAESLPFTLETVSSWELEAWYEDLQEVLSSDEIGGTYISSPGNEEEESKTFTTLDPASL AWLTEEPGPTEVTRTSQSPRSPDSSQSSMAQEEEEEEQGRTRKRKQSGQCPARPGKQRMK EKEQENERKVAQLAEENERLKQEIERLTREVETTRRALIDRMVSLHQA

alignment looks totally off

pgaudet commented 3 years ago

well it's an 'upstream orf' - I looked at the paper quickly (PMID:21285359), it seems like that ORF is specifically translated under conditions of stress. So it does come from the same transcript, but the start site is not normally used.

hdrabkin commented 3 years ago

@pgaudet 'MGI states that it corresponds to both P35639 and A0A2R8VHR8, although their sequences have nothing in common ; just derived from the same gene/transcript" We do not state anything other than these are the UniProt ids that map to this gene object. There are other instances of two unrelated SwissProt ids that are coded by one gene that do not have similar sequences. Quite often these have totally different functions but they are still derived from the same gene.

hdrabkin commented 3 years ago

"GOA applies the PAINT annotations to all entries corresponding to MGI:109247, ie both P35639 and A0A2R8VHR" So if GOA used the 'UP000000589_10090.fasta ' reference proteome file (again, we were told this is the source for the reference proteome, instead of the MRK_SwissProt_TrEMBL.rpt then I believe this won't happen?

alexsign commented 3 years ago

@hdrabkin this will remove 106716 mouse annotation to the reference proteome proteins, which are vital for proteomics research community. I think GOC should decide on that, not me.

pgaudet commented 3 months ago

@RLovering The original annotation seems to from from NTNU UniProt: P35638 + PMID:22065586

Should this be removed?

Thanks, Pascale

RLovering commented 3 months ago

Hi Pascale

I think it has all been done. The dbTF annotations are now only associated with P35638 + PMID:22065586 - which is correct. DDIT3 is a dbTF, at least it is listed on our dbTF list as a dbTF. Or are you saying DDIT3 is not a dbTF?

There are annotations by MGI to A0A2R8VHR8 which is the Product of the upstream open reading frame (uORF) of DDIT3/CHOP.

A0A2R8VHR8 should not be annotated as a dbTF - to get this changed you need to contact MGI and you need to see if PAINT is creating any annotations to A0A2R8VHR8 based on DNA seq similarity to P35638 - which should not be present.

Ruth

pgaudet commented 3 months ago

OK, great !