geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

human entries missing from AMiGO CDKN2A Q8N726 etc SLC35A4, MOCS2 #2082

Closed ValWood closed 5 years ago

ValWood commented 6 years ago

This is a human protein entry, in Swiss-prot https://www.uniprot.org/uniprot/Q8N726 but missing from AmiGO.

integrated into UniProtKB/Swiss-Prot: October 11, 2005

ValWood commented 6 years ago

This is in UniPROT but with a different ID. http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P42771

@Antonialock this is why I recommended that we use the QforO set. It aligns with AMiGO

https://www.uniprot.org/uniprot/P42771

ValWood commented 6 years ago

Caution The proteins described here are encoded by the gene CDKN2A, but are completely unrelated in term of sequence and function to tumor suppressor ARF (AC Q8N726) which is encoded by the same gene.Curated

So should both of these entries be in AMiGO?

I think the problem may be caused here becasue both have the same gene name?

ValWood commented 6 years ago

@Antonialock For this one do not worry that CDKN2A is not present. We have the other isoform. There won't be many cases like this. https://www.uniprot.org/uniprot/Q8N726

Antonialock commented 6 years ago

Proteins in the Uniprot reference proteome set but not in AmiGO

P0C7T4 Q6P5R6 L0R6Q1 P58400 P58401 Q9HDB5 Q13765 Q8N726 O43687

P62861 Q8NDA8 O96033 P30443 P30459 P16188 P16189 P01892 P04439 P10316 Q09160 P13746 P30447 P05534 P18462 P30450 P30512 P10314 P30453 P30455 P30456 P16190 P30457 P30498 Q29718 P01889 P30461 P30466 P03989 Q95365 Q04826 P30480 P30486 P10319 Q29940 P30460 P30462 P18463 P30475 P30479 P30481 P30483 P30484 P30485 P30487 P30490 P30491 P30488 P30495 P18465 Q29836 P30492 Q31610 P30464 P30685 P18464 P30493 Q9TNN7 P30510 P30501 Q29963 P30508 Q07000 Q29865 P30499 P04222 P30504 P10321 P30505 Q29960 Q95IE3 Q9GIY3 P13761 P01912 Q9TQE0 Q5Y7A7 Q29974 P04229 Q30167 P13760 Q30134 P20039

ValWood commented 6 years ago

P0C7T4 not in the April reference proteome

Q6P5R6 RPL22L1 60S ribosomal protein L22-like 1 In AmiGO http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q6P5R6 BUT does not have "cytoplasmic translation" as the UniPROT entry does https://www.uniprot.org/uniprot/Q6P5R6 60S ribosomal protein L22-like 1 ONLY has 60S ribosomal protein L22-like 1

P58400 NRXN1 has another entry in Swiss-Prot https://www.uniprot.org/uniprot/Q9ULB1 so one isoform is in AmiGO Ask Uniprot why (there are many isoforms so it isn't clear why there 2 have separate entries)

P58401 NRXN2 Q9HDB5 NRXN3

Q13765 NACA AmiGO uses a different entry for NACA http://amigo.geneontology.org/amigo/gene_product/UniProtKB:E9PAV3 but it only has IEA annotation The manual annotation is on https://www.uniprot.org/uniprot/Q13765 But it is a strange collection of annotations for Nascent polypeptide-associated complex subunit alpha

Q8N726 CDKN2A

O43687 AKAP7 A-kinase anchor protein 7 isoforms alpha and beta (and gamma?) https://www.uniprot.org/uniprot/O43687 In GO it is represented by http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9P0M2 Isoforms of the same protein are often annotated in two different entries if their sequences differ significantly.

P62861 FAU is used in Unitpot to decsribe both proteins synthesized by the ubiquitin/ribosomal fusion, but they are represented as separte uniprot entries, hence one gets 'lost' when the redundent entries are removed from the refernece proteome (this is not in th QFO set)

Q8NDA8 O96033 P30443 P30459 P16188 P16189 P01892 P04439

ValWood commented 6 years ago

@Antonialock in all of the cases I looked at there are 2 isoforms or 2 proteins (i.e tandem fusion) described by a single gene name. I suspect they when the reference set is defined if duplicate names exist the longest isoform is retained and the other one is removed. This works OK for isoforms but not in other edge cases (tandem fusions or the same gene name applied to 2 adjacent but completely independent loci- sometimes opposite strand/nested). For our purposes we can ignore these and say we used the entry in the QFO set it's only a small number and it isn't really up to us to sort this out...... v

ValWood commented 6 years ago

Added to Representing complete proteomes in GO agenda item http://wiki.geneontology.org/index.php?title=2018_Montreal_GOC_Meeting_Agenda&action=edit&section=16

ValWood commented 6 years ago

@Antonialock why isn't O95278 in this list? Is the list complete, or is this a different 'type' of issue?

ValWood commented 6 years ago

O95278 https://www.uniprot.org/uniprot/O95278 descirbes 8 laforin isoforms and is clearly the canonical larorin entry.

GO uses https://www.uniprot.org/uniprot/B3EWF7#sequences which is describes as laforin isoform9 MGI have annotated this one using https://www.uniprot.org/citations/20453062 but it is not clear to me that this annotation really belongs on this isoform It seems to be about canonical laforin (the sequences don't really have anything in common although they are the same locus)

Is there any evidence at all that isoform9 is functional? or is it a pathological variant (i.e wrong frame translation)? If so how are these handled?

ValWood commented 6 years ago

anyway with the current situation we lose all of this annotation in GO https://www.uniprot.org/uniprot/O95278#sequences because the non-canonical isoform is represented in the reference proteome

Antonialock commented 6 years ago

Yes O95278 same story I had just missed tagging it with "not in amigo" (was only tagged as "already annotated") https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=O95278

ValWood commented 6 years ago

MOCS2 is really 2 separate proteins. Both have the same HGNC name

https://www.uniprot.org/uniprot/O96007 https://www.uniprot.org/uniprot/O96033 This protein is produced by a bicistronic gene which also produces the large subunit (MOCS2B) from an overlapping reading frame. Expression of these 2 proteins are related since a mutation that removes the start codon of the small subunit (This protein is produced by a bicistronic gene which also produces the large subunit (MOCS2B) from an overlapping reading frame. Expression of these 2 proteins are related since a mutation that removes the start codon of the small subunit (MOCS2A) also impairs expression of the large subunit (MOCS2B).) also impairs expression of the large subunit (MOCS2B).

ValWood commented 6 years ago

L0R6Q1 this one has probably been filtered as redundant because it has the name GeneSLC35A4 even though it is SLC35A4 upstream open reading frame protein this needs renaming since it is an independent protein https://www.ebi.ac.uk/interpro/entry/IPR027854/taxonomy short transmembrane mitochondrial protein

sartweedie commented 6 years ago

HGNC name at the gene level rather than the transcript level. While functionally distinct, MOCS2A and MOCS2B are alternative products of the same gene (MOCS2) PMID:16737835 so we wouldn't give them separate HGNC symbols.

The regulatory upstream ORF L0R6Q1 is also annotated as an alternative transcript of the SLC35A4 gene - though I agree the name here doesn't apply to both products. It is also odd that our entry points to the UniProt entry for the upstream ORF rather than SLC itself - I'll raise these points at our next curator meeting.

Maybe you shouldn't be filtering out distinct manually reviewed UniProt entries associated with the same UniProt gene name (what HGNC refer to as gene symbol)? Or could that bring in true redundancy? I notice that the L0R6Q1 entry had the comment "Isoforms of the same protein are often annotated in two different entries if their sequences differ significantly" but that comment wasn't present in either of the MOCS2 entries.

ValWood commented 6 years ago

Maybe you shouldn't be filtering out distinct manually reviewed UniProt entries associated with the same UniProt gene name (what HGNC refer to as gene symbol)

Redundant entries are filtered when the "reference proteome" used by GO is created. Only one entry is retained per HGNC symbol (I think this is the mechanism used to remove true redundancy, but I guess there could be other ways to do this, using Sp not tr for example).

This caused a problem for our analysis because only one entry gets into AmiGO. Therefore when we create a reference dataset from UniProt, many of the entries are not in GO (We are trying to slim them to pull out 'unknowns' but we get a lot of unmapped entries, which have annotation but are not in AmiGO)

I always assumend that if HGNC IDs refer to 2 separate proteins, we could use HGNC-ID symbol as a the unique name.... and refernece proteome is also clearly using this assumption too.

MOCS2 seems that it should be a single entry because MVPLCQVEVL YFAKSAEITG VRSETISVPQ EIKALQLWKE IETRHPGLAD VRNQIIFAVR QEYVELGDQL LVLQPGDEIA VIPPISGG is only 88 residues and so clearly doesn't represent the protein described in the entry for MOCS2a becasue the structure shown is 188 AAs (is mOCS2B).

ValWood commented 6 years ago

looking at MOCS2 again it is described as This protein is produced by a bicistronic gene which also produces the large subunit (MOCS2B) from an overlapping reading frame. Expression of these 2 proteins are related since a mutation that removes the start codon of the small subunit (MOCS2A) also impairs expression of the large subunit (MOCS2B).

so, although it is overlapping with, and regulatory, it encodes a completely different protein

https://en.wikipedia.org/wiki/MOCS2 Molybdenum cofactor synthesis protein 2A and molybdenum cofactor synthesis protein 2B are a pair of proteins that in humans are encoded from the same MOCS2 gene.[5][6][7] These two proteins dimerize to form molybdopterin synthase.

Are overlapping genes never allowed to have their own HGNC ID's? That seems a bit limiting?

ValWood commented 6 years ago

couldn't they be MOCS2A and MOCS2B both with the synonym MOCS2?

ValWood commented 6 years ago

i.e it's not just a different isoform from alternative splicing. There are different genes/proteins that just happen to be overlapping. Bizarrely...

RLovering commented 6 years ago

I assume you are aware that many curators are annotating UniProt IDs in Protein2GO. The GO annotations are exported to Ensembl. Ensembl then exports only the annotations associated with the longest transcript ID for the gene, so tools getting their annotations from Ensembl do not get as many (often manual) annotations as people getting their annotations from EMBL-EBI.

But I don't know how the Ensembl list of 'unique' genes relates to the reference gene list used by AmiGO.

ValWood commented 6 years ago

I don't think this particular problem I'm seeing is anything to do with Ensembl. The set of UniProt entries for human in AmiGO/GO seems to match the reference proteome set here: https://www.ebi.ac.uk/reference_proteomes This has duplicate HGNC's removed, even if they encode different proteins. The assumption is that they are isoforms but in some of the examples above they are completely different proteins (the cases I flagged for HGNC). Sometimes, they are radically different translations and so UniPROT retains both copies (it would be better if all isoforms could be represented by a single entry.... especially if the radically different ones are from missense mutations, and are disease variants, I suspected this was the case from the few I looked at.)

I am down such a big rabbit hole here.......I need to get out!

sartweedie commented 6 years ago

Plenty genes encode different proteins that do different things - these are just the extreme cases. I don't see it as much different from annotating to a generic UniProt and put the relevant isoform in column 17. Here you are annotating to the gene and need the separate UniProts in column 17. I do take you point about them having no sequence in common and I'll raise it with the group but, broadly speaking, if it is annotated as a single gene and the literature describes it as a single gene then we name it as a single gene - that is certainly the case for MOCS2.

ValWood commented 6 years ago

if it is annotated as a single gene and the literature describes it as a single gene then we name it as a single gene

@selewis the reference proteome curators need to be aware of this. At the moment one (the shortest) gets chucked...

ValWood commented 6 years ago

re MOCS2 Adding this which was part of my response to the UniPRot helpdesk ticket.

There is no shared protein sequence here (i.e no common exons), they are 2 independent gene products. Isn't this two genes because they have independent coordinates/positions? encoded a bi-cistronic loci?

MVPLCQVEVLYFAKSAEITGVRSETISVPQEIKALQLWKEIETRHPGLADVRNQIIFAVR QEYVELGDQLLVLQPGDEIAVIPPISGG

sp|O96007|MOC2B_HUMAN Molybdopterin synthase catalytic subunit OS=Homo sapiens OX=9606 GN=MOCS2 PE=1 SV=1 MSSLEISSSCFSLETKLPLSPPLVEDSAFEPSRKDMDEVEEKSKDVINFTAEKLSVDEVS QLVISPLCGAISLFVGTTRNNFEGKKVISLEYEAYLPMAENEVRKICSDIRQKWPVKHIA VFHRLGLVPVSEASIIIAVSSAHRAASLEAVSYAIDTLKAKVPIWKKEIYEESSTWKGNK ECFWASNS

How would these be treated if they happened to be encoded at the same loci on opposite strands ? Would they would be given independent gene names? I'm not sure that this is different. The genes will have independent coordinates. I wanted to confirm this by looking in Ensembl, but guess what, Ensembl only has MOCS2B!

sartweedie commented 6 years ago

I raised this with the group and contrary to my understand we have previously named separate ORFs in bicistronics (e.g. SNRPN and the upstream ORF SNURF). However, Elspeth's memory was that MOCS2 was no longer considered bicistronic - that fits with the NCBI gene comment but I haven't had a chance to follow up on the evidence. The orf upstream of SLC35A4 could be named like SNURF if the evidence supporting it holds up - it seems to come from one paper and it isn't annotated on NCBI Gene so I may get their annotation input first. Anyway, sorry for the misinformation - I'll follow up on these.

ValWood commented 6 years ago

MOCS2 translations : mocs2

The yellow CDS is MOCS2B and the magenta one is MOCS2A.

They appear to have independent transcripts, but they share no coding exons. The coding exon that overlaps in the sequence is in a different reading frame.

I haven't seen another example like this. It would be nice if these justified different HGNC names. Otherwise their difference becomes a bit obscured to a non-expert, or to anyone trying to compile a Non-redundant protein st.

ValWood commented 5 years ago

Not a GO issue. Documented for UniProt and GO minutes