Closed ValWood closed 5 years ago
FYI @Antonialock
I'm not quite sure what the question here is. For example, "GABARAP" is a symbol coming in from UniProt, MGI, and RGD. Symbols are quite often duplicates, which is why many services use namespaced identifiers or filtering mechanisms to isolate the actual entity they want.
These are duplicate human entries. We should only have one entry per GP in GO.
@ValWood You'd consider the above to be the same entity? https://www.uniprot.org/uniprot/O95166 https://www.uniprot.org/uniprot/H6UMI1
Yes,
https://www.uniprot.org/uniprot/O95166 https://www.uniprot.org/uniprot/H6UMI1 are the same entity. Now I look more closely one is unreviewed, so it shouldn't get into GO?
Some might be exact duplicates at different loci (calmodulin histones and elongation factors), but these should be distinguished by having different names.
I think its a question for UniProt...
So MED17, there is only one copy in the human genome, but we have
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1W2PRB8 and ~http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1W2PRB8~
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9NVC6 in GO
I know it is only a small number (now), it is much improved, but we should find out what the problem is with the ingest that makes this possible. Presumably there is only one entry in reference proteomes. It's really really important to represent the human proteome uniquely and correctly for analysis.
I think you meant this as the 2nd URL
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9NVC6
(which is the correct one)
This is how we get it from GOA:
curl -s -L ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz | gzip -dc | grep MED17
e.g.
UniProtKB A0A1W2PRB8 MED17 GO:0003712 GO_REF:0000002 IEA InterPro:IPR019313 F Mediator of RNA polymerase II transcription subunit 17 MED17|MED17 protein taxon:9606 20180616 InterPro
UniProtKB A0A1W2PRB8 MED17 GO:0006351 GO_REF:0000038 IEA UniProtKB-KW:KW-0804 P Mediator of RNA polymerase II transcription subunit 17 MED17|MED17 protein taxon:9606 20180616 UniProt
UniProtKB A0A1W2PRB8 MED17 GO:0006357 GO_REF:0000002 IEA InterPro:IPR019313 P Mediator of RNA polymerase II transcription subunit 17 MED17|MED17 protein taxon:9606 20180616 InterPro
UniProtKB A0A1W2PRB8 MED17 GO:0016592 GO_REF:0000002 IEA InterPro:IPR019313 C Mediator of RNA polymerase II transcription subunit 17 MED17|MED17 protein taxon:9606 20180616 InterPro
UniProtKB Q9NVC6 MED17 GO:0003712 PMID:10198638 IDA F Mediator of RNA polymerase II transcription subunit 17 MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80 protein taxon:9606 20030822 UniProt
UniProtKB Q9NVC6 MED17 GO:0003712 PMID:12218053 IDA F Mediator of RNA polymerase II transcription subunit 17 MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80 protein taxon:9606 20030822 UniProt
UniProtKB Q9NVC6 MED17 GO:0003713 PMID:12037571 IDA F Mediator of RNA polymerase II transcription subunit 17 MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80 protein taxon:9606 20101104 MGI
[snip]
@tonysawfordebi and @alexsign can you take a look (I assigned you Alex, but Tony can reassign to himself if appropriate)
@dougli1sqrd and @pgaudet should we have a soft check / warning for >1 ID with the same symbol in a species? It would be useful to have this kind of reporting information up-front.
should we have a soft check / warning for >1 ID with the same symbol in a species?
yes please, that would be a useful QC check
@Chris Mungall cmungall@googlemail.com Not sure if this is relevant, but I know in PANTHER there is a many2many relationship between genes and proteins (though never both at once, just
1gene to 1 protein, or >1 protein to a gene)
-S
On Mon, Jul 2, 2018 at 3:04 PM Val Wood notifications@github.com wrote:
should we have a soft check / warning for >1 ID with the same symbol in a species?
yes please, that would be a useful QC check
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/geneontology/helpdesk/issues/139#issuecomment-401951925, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcuEF4dTXb5A9xY-tEwDWl5SbIENgqCks5uCpjfgaJpZM4U830h .
found it https://github.com/pantherdb/db-PAINT/issues/1
On 03/07/2018 04:30, Suzanna Lewis wrote:
@Chris Mungall cmungall@googlemail.com Not sure if this is relevant, but I know in PANTHER there is a many2many relationship between genes and proteins (though never both at once, just
1gene to 1 protein, or >1 protein to a gene)
-S
On Mon, Jul 2, 2018 at 3:04 PM Val Wood notifications@github.com wrote:
should we have a soft check / warning for >1 ID with the same symbol in a species?
yes please, that would be a useful QC check
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub
https://github.com/geneontology/helpdesk/issues/139#issuecomment-401951925, or mute the thread
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/helpdesk/issues/139#issuecomment-402002519, or mute the thread https://github.com/notifications/unsubscribe-auth/AHBLKDgE_EVPJO7KOBsQDFrH4qSAHd9Aks5uCuVRgaJpZM4U830h.
-- University of Cambridge PomBase http://www.pombase.org/ Cambridge Systems Biology Centre http://www.sysbiol.cam.ac.uk/Investigators/val-wood
According to the data that we get from UniProt, both Q9NVC6 (Swiss-Prot) and A0A1W2PRB8 (TrEMBL) are canonical entries in the human GCRP, and they both have MED17 as the gene name, which doesn't seem right. I'll raise this with UniProt.
Thansk @tonysawfordebi could you ask them to look at the full list.
Roger that, @ValWood
Proteins nearly good, RNA's, same issues looming https://github.com/geneontology/amigo/issues/511 I filed this on the AmiGO tracker, because that's where I saw the problem. But it's clearly the wrong place. Who would be the correct person for this part of the pipeline? https://github.com/geneontology/amigo/issues/511
I just checked the list of genes from the top of this thread, and this is what I found:
Gene Name | Entry | Type |
---|---|---|
ATP6AP2 | O75787 | Swiss-Prot |
ATP6AP2 | A0A1C7CYW4 | TrEMBL |
CALM1 | P0DP23 | Swiss-Prot |
EIF3F | O00303 | Swiss-Prot |
GABARAP | O95166 | Swiss-Prot |
GABARAP | H6UMI1 | TrEMBL |
HOXD4 | P09016 | Swiss-Prot |
HOXD4 | A0A087WSZ3 | TrEMBL |
IDS | P22304 | Swiss-Prot |
IDS | B3KWA1 | TrEMBL |
KLK9 | Q9UKQ9 | Swiss-Prot |
KLK9 | Q2XQG4 | TrEMBL |
MED17 | Q9NVC6 | Swiss-Prot |
MED17 | A0A1W2PRB8 | TrEMBL |
MUC21 | Q5SSG8 | Swiss-Prot |
MUC21 | A0A0G2JKD1 | TrEMBL |
MUC21 | A0A140T8X8 | TrEMBL |
NSG1 | P42857 | Swiss-Prot |
PI4K2B | Q8TCG2 | Swiss-Prot |
PI4K2B | G5E9Z4 | TrEMBL |
SUPT3H | O75486 | Swiss-Prot |
TMSB15B | P0CG35 | Swiss-Prot |
TMSB15B | A0A087X1C1 | TrEMBL |
TRAPPC2L | Q9UL33 | Swiss-Prot |
TRAPPC2L | H3BP13 | TrEMBL |
So, it appears that there's no ambiguity as far as CALM1, EIF3F, NSG1, and SUPT3H are concerned (there's only one canonical entry in the GCRP), but for the others there definitely appears to be something amiss (particularly MUC21).
This is what we have (taken from UniProt):
Gene | Entry | Type | Proteome | Canonical Entry |
---|---|---|---|---|
CALM1 | P0DP23 | Swiss-Prot | Canonical | |
CALM1 | B4DJ51 | TrEMBL | none | |
CALM1 | G3V479 | TrEMBL | Isoform | P0DP23 |
CALM1 | E7ETZ0 | TrEMBL | Isoform | P0DP23 |
CALM1 | Q96HY3 | TrEMBL | Isoform | P0DP23 |
CALM1 | M0QZ52 | TrEMBL | Isoform | P0DP23 |
CALM1 | G3V226 | TrEMBL | Isoform | P0DP23 |
CALM1 | G3V361 | TrEMBL | Isoform | P0DP23 |
EIF3F | O00303 | Swiss-Prot | Canonical | |
EIF3F | A0A1W2PP79 | TrEMBL | Isoform | O00303 |
EIF3F | E9PQV8 | TrEMBL | Isoform | O00303 |
EIF3F | B3KSH1 | TrEMBL | none | |
EIF3F | B4DMT5 | TrEMBL | none | |
EIF3F | H0YDT6 | TrEMBL | Isoform | O00303 |
NSG1 | P42857 | Swiss-Prot | Canonical | |
NSG1 | A0A0A6YYJ2 | TrEMBL | Isoform | P42857 |
SUPT3H | O75486 | Swiss-Prot | Canonical | |
SUPT3H | Q5VWT9 | TrEMBL | Isoform | O75486 |
SUPT3H | B4E1H0 | TrEMBL | Isoform | O75486 |
SUPT3H | Q5U608 | TrEMBL | none | |
SUPT3H | A0A024RD67 | TrEMBL | none |
actually these are different proteins, with the same name. How does that happen? http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A0A6YYJ2 Neuron-specific protein family member 1
http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P42857 Neuronal vesicle trafficking-associated protein 1
CALM1 https://www.uniprot.org/uniprot/P62158 is obsolete
I filed this on the AmiGO tracker, because that's where I saw the problem. But it's clearly the wrong place. Who would be the correct person for this part of the pipeline?
In this case it's inputs to the pipeline rather than the pipeline itself, and Tony is already on it. In general I would say the go-annotations tracker is good for coordinating with any contributing group about their annotations. helpdesk always fine for triaging, and I suggest keeping this discussion here to avoid breaking history.
@cmungall Yeah we could do something like that. So, the check would have to find different gene product ids that have the same label? Do we have labels in the RDF for gene products? I can take a look.
@cmungall As whole tickets are ported, including comments, there would be no break in history.
@dougli1sqrd symbols are not globally unique; in fact, occasionally not locally either--it may be worth asking whether there should be a mechanism per-species.
I was referring to this ticket being in the wrong place. There are no comments and no assignee.
geneontology/amigo#511
Should I move this one to the annotation tracker?
@ValWood I'll go ahead and move it.
Looks like I won't be moving it: https://github.com/google/github-issue-mover/issues/128#issuecomment-344261982
This issue was moved to geneontology/go-annotation#2021
Given mover workaround finally worked (remove assigned) after closing and re-starting https://github-issue-mover.appspot.com/
Hi,
Sorry I never checked this, but it isn't fixed. I just checked the first 2 in the list and these human entries are still in duplicate in AmiGO.
Or is there another open ticket? I followed the other tickets around but I could not find another ticket that described this precise problem, or any indication what the fix is.
Please let me know when the fix will be coming along, or if there is no open ticket, tell me where to log it.
Thanks
@cmungall @kltm @tonysawfordebi
@dustine32 Can you have a look at that ? If I understand the ticket correctly these may be coming from PAINT.
Still current. Didn't we go through a release cycle yet?
@ValWood How do you find these ? I can't search on gene names (easily) in AmiGO.
This does not have anything to do with PAINT.
I provided an explanation for this in my comment on July 2 (https://github.com/geneontology/helpdesk/issues/139#issuecomment-401950612)
as can be seen, there are different human IDs that share the same symbol. This is still the case, you can re-run the command I provided in my previous comment and get the same results.
I re-assigned @alexsign / @tonysawfordebi (looks like this was de-assigned when we tried to move this ticket to go-annotation. I removed the go-annotation copy rather than fork the discussion).
Tony, on July 3 you said you would raise this with UniProtKB - are we any the wiser as to why this is happening?
All: is this something we should take preventative measures in the GO Central pipeline? It seems like something that should be handled upstream.
@ValWood How do you find these ? I can't search on gene names (easily) in AmiGO.
Search on a name and hit enter (don't select from the drop down, since you would need to guess). You will get to a landing page which has a link to ALL gene products with the name. Select this, and then filter on human.
@cmungall I did raise it with UniProt, and I haven't had a (satisfactory) response to date; I'll give them another prod.
I've just run a query in our database, and it looks like we have 50 genes in the human GCRP for which there are multiple UniProt accessions tagged as being the canonical entry for the gene, the winner being HERVK_113, for which these five Swiss-Prot entries are listed as being the canonical one: Q902F9, P63121, P62684, P63132, and P61574
Hi, I will try to explain the issue from our point of view, lets take an example: From table GENE_CENTRIC_ENTRY, for Gene Name MED17 (Q9NVC6 x A0A1W2PRB8):
ACCESSION ENTRY_TYPE NAME LENGTH GROUP_ID TAX_ID IS_CANONICAL RELEASE GENE_NAME_TYPE UPID
A0A1W2PRB8 1 MED17 838 46703535 9606 2018_09 5 UP000005640
A0A1W2PRB8 1 ENSG00000284057 838 46703535 9606 1 2018_09 2 UP000005640
Q9NVC6 0 HGNC:2375 651 2571471 9606 1 2018_09 1 UP000005640
Q9NVC6 0 ENSG00000042429 651 2571471 9606 2018_09 2 UP000005640
Q9NVC6 0 MED17 651 2571471 9606 2018_09 5 UP000005640
Q9NVC6-2 0 HGNC:2375 145 2571471 9606 2018_09 1 UP000005640
Q9NVC6-2 0 MED17 145 2571471 9606 2018_09 5 UP000005640
The heuristic rules we have to generate such data above are:
Gene-Centric UniProt Reference Proteomes are created according to the summarised following criteria:
So, bottom line, I cannot group Q9NVC6 and A0A1W2PRB8 together because the only evidence for that is the Gene Name MED17, and it's not enough! It needs a higher rank evidence (Ensembl_id). This issue is spawned AFAIK, for this particular case at least, from our Automatic Annotation pipeline, where, apparently RuleBase:RU364140 has decided to assign such GN to it.
I will discuss with AA team about it and the other potential cases.
But why is https://www.uniprot.org/uniprot/A0A1W2PRB8 and unreviewed Trembl entry included at all?
If Swiss-Prot already contain all of the validated human proteins, one per loci?
Presumably if a new protein is identified it is included rapidly in Swiss-Prot so this dataset would be a more robust set for the reference proteome? We shouldn't need to consider the Trembl entries at all?
This is a good point. We aim, for human, to address all TR cases, but I don't know when we'll get there. So, in a sense, if you want to play safe, you should consider only SP ones. TR are less the 5% of the cases.
so in the mediator case, the protein comes from https://www.ebi.ac.uk/ena/data/view/AP001894 which is a very old DNA sequence (2000)
These aren't very useful. I wonder how many similar situations are included.
I had assumed that these days sequences which were not mapped unambiguously onto the reference genome assembly were ignored for the purposes of creating the reference proteome?
I check with AA and there's nothing wrong with the method to apply the GN MED17 to A0A1W2PRB8. Are you dealing only with Human? If so, then I'd re-assert to use only SP accessions in your project.
OK, it makes sense to use only the UniProt proteins (which is what we did for our analyses, hence the discrepancies). I agree that there is nothing wrong with the mapping of MED17 to A0A1W2PRB8. But, the problem is that it is present in the Q for O reference proteome, and hence represented twice in GO.
In this case why not export only Swiss-Prot entries to the Quest for Orthologs reference proteome?
@selewis tagging Suzi for common interest in the Q for O dataset.
Also cc-ing @thomaspd
Are you dealing only with Human? If so, then I'd re-assert to use only SP accessions in your project.
The GO deals with all species
While in principle we should simply elect to only take the SP subset for human, we don't want to be making these decisions for N genomes. The idea is that the QfO/GCRP group makes these decisions for each genome, and we defer to them.
I would second Val's question, shouldn't the QfO GCRP selection be more restrictive here?
The problem is how to be more restrictive avoiding hacking particular solutions for every species cases. So, I understand now that you use all 78 QfO species, and our pipeline here use the same rules for all of them, no exceptions, no tweaks, no special hacks. Currently we have this:
OSCODE TAX_ID TOTAL TR SP SP_PERCENT
ANOGA 7165 12428 12179 249 2%
AQUAE 224324 1553 769 784 50%
ARATH 3702 27581 11937 15644 57%
ASPFU 330879 9648 8877 771 8%
BACSU 224308 4260 75 4185 98%
BACTN 226186 4782 4410 372 8%
BATDJ 684364 8610 8604 6 0%
BOVIN 9913 22008 16009 5999 27%
BRADU 224911 8253 7591 662 8%
BRAFL 7739 28542 28501 41 0%
CAEEL 6239 19921 15917 4004 20%
CANAL 237561 6035 5034 1001 17%
CANLF 9615 20269 19452 817 4%
CHICK 9031 19122 16832 2290 12%
CHLAA 324602 3850 3604 246 6%
CHLRE 3055 17608 17533 75 0%
CHLTR 272561 895 481 414 46%
CIOIN 7719 16678 16653 25 0%
CRYNJ 214684 6603 6235 368 6%
DANRE 7955 25289 22246 3043 12%
DEIRA 243230 3085 2572 513 17%
DICDI 44689 12738 8607 4131 32%
DICTD 515635 1743 1590 153 9%
DROME 7227 13767 10320 3447 25%
ECOLI 83333 4313 0 4313 100%
FUSNN 190304 2046 1690 356 17%
GEOSL 243231 3402 2990 412 12%
GIAIC 184922 7154 7137 17 0%
GLOVI 251221 4406 3965 441 10%
GORGO 9595 21796 21514 282 1%
HALSA 64091 2426 1935 491 20%
HELPY 85962 1553 959 594 38%
HELRO 6412 23328 23327 1 0%
HUMAN 9606 20996 820 20176 96%
IXOSC 6945 20468 20448 20 0%
KORCO 374847 1602 1525 77 5%
LEIMA 5664 8038 7989 49 1%
LEPIN 189518 3676 3292 384 10%
LEPOC 7918 18320 18320 0 0%
MAIZE 4577 39442 38658 784 2%
METAC 188937 4468 3958 510 11%
METJA 243232 1787 0 1787 100%
MONBE 81824 9188 9153 35 0%
MONDO 13616 21272 21229 43 0%
MOUSE 10090 22296 5386 16910 76%
MYCGE 243273 483 0 483 100%
MYCTU 83332 3993 1827 2166 54%
NEIMB 122586 2001 1419 582 29%
NEMVE 45351 24321 24197 124 1%
NEUCR 367110 9759 8906 853 9%
NITMS 436308 1795 1704 91 5%
ORYLA 8090 19698 19611 87 0%
ORYSJ 39947 43569 39645 3924 9%
PANTR 9598 23056 22363 693 3%
PARTE 5888 39461 39407 54 0%
PHANO 321614 15998 15757 241 2%
PHYPA 3218 30837 30763 74 0%
PHYRM 164328 15349 15349 0 0%
PLAF7 36329 5441 5278 163 3%
PSEAE 208964 5563 4239 1324 24%
PUCGT 418459 15688 15673 15 0%
RAT 10116 21465 13462 8003 37%
RHOBA 243090 7271 6909 362 5%
SCHPO 284812 5142 1 5141 100%
SCLS1 665079 14445 14279 166 1%
STRCO 100226 8038 7263 775 10%
SULSO 273057 2938 2466 472 16%
SYNY3 1111708 3507 2444 1063 30%
THAPS 35128 11717 11612 105 1%
THEKO 69014 2301 1882 419 18%
THEMA 243274 1852 1304 548 30%
THEYD 289376 1982 1779 203 10%
TRICA 7070 16564 16562 2 0%
TRIVA 5722 50190 50177 13 0%
USTMA 237631 6788 6430 358 5%
XENTR 8364 24208 22503 1705 7%
YARLI 284591 6448 5798 650 10%
YEAST 559292 6049 0 6049 100%
So, even if SP accessions were all fine (they aren't, see HERVK_113 and HERV-K104, human endogenous retrovirus group K), we have species with 0% SP accessions.
Are the issue you have with duplicate only in Human? Anyway, your issue remind us the problem we had 4 years ago, when Paul Thomas pointed out we were wrongly grouping accessions because of common GN (rule # 6 here ). For that reason we don't rely on GN when sorting gene-centric groups. For MOD species, like human, we use their MOD id, HGNC_id in case, but that's only 16 species so far.
We need further discussions, I may even meet Val in Cambridge.
Happy to discuss, but we should see what people really need for this set.
It is important to get the human proteome representative, and 820 entries from Trembl seems a lot if most canonical human sequences are represented in UniProt
HUMAN 9606 20996 820 20176 96%
It would be great if the heuristic could be tweaked without detrimentally affecting the non-model species...
The number 20996 seems very inflated for human? At least, the number we got after using the UniProt recommendations was was 19737 (after filtering transposons)
Do we have a next concrete action for this ticket? Does it have components to spin out into other trackers so there are more (domain-specific eyes on it)?
Does uniprot have a github tracker for things like this? If not I suggest we host the discussion on go-annotation.
I would not attempt to move the ticket. I would create a new one and reference this one
We still have dupliciate entries in the GO database, which makes analyses difficult
These 15 identifiers were found to be ambiguous: ATP6AP2 CALM1 EIF3F GABARAP HIST1H2AI HOXD4 IDS KLK9 MED17 MUC21 NSG1 PI4K2B SUPT3H TMSB15B TRAPPC2L
ATP6AP2
and an unreviewed Trembl entry? http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1C7CYW4 and the annotated swiss prot entry https://www.uniprot.org/uniprot/O75787
Calm1, the uniprot entry http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P0DP23 and an unannotated PR? entry http://amigo.geneontology.org/amigo/term/PR:000004978
Can we get rid of the duplicates? How do they get in?