geneontology / helpdesk

The Gene Ontology Helpdesk
http://help.geneontology.org
16 stars 6 forks source link

duplicate human entries in the GO database #139

Closed ValWood closed 5 years ago

ValWood commented 6 years ago

We still have dupliciate entries in the GO database, which makes analyses difficult

These 15 identifiers were found to be ambiguous: ATP6AP2 CALM1 EIF3F GABARAP HIST1H2AI HOXD4 IDS KLK9 MED17 MUC21 NSG1 PI4K2B SUPT3H TMSB15B TRAPPC2L

ATP6AP2

and an unreviewed Trembl entry? http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1C7CYW4 and the annotated swiss prot entry https://www.uniprot.org/uniprot/O75787

Calm1, the uniprot entry http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P0DP23 and an unannotated PR? entry http://amigo.geneontology.org/amigo/term/PR:000004978

Can we get rid of the duplicates? How do they get in?

ValWood commented 6 years ago

FYI @Antonialock

kltm commented 6 years ago

I'm not quite sure what the question here is. For example, "GABARAP" is a symbol coming in from UniProt, MGI, and RGD. Symbols are quite often duplicates, which is why many services use namespaced identifiers or filtering mechanisms to isolate the actual entity they want.

ValWood commented 6 years ago

These are duplicate human entries. We should only have one entry per GP in GO.

kltm commented 6 years ago

Human symbol "GABARAP": http://amigo.geneontology.org/amigo/gene_product/UniProtKB:O95166 http://amigo.geneontology.org/amigo/gene_product/UniProtKB:H6UMI1

kltm commented 6 years ago

@ValWood You'd consider the above to be the same entity? https://www.uniprot.org/uniprot/O95166 https://www.uniprot.org/uniprot/H6UMI1

ValWood commented 6 years ago

Yes,

https://www.uniprot.org/uniprot/O95166 https://www.uniprot.org/uniprot/H6UMI1 are the same entity. Now I look more closely one is unreviewed, so it shouldn't get into GO?

Some might be exact duplicates at different loci (calmodulin histones and elongation factors), but these should be distinguished by having different names.

I think its a question for UniProt...

ValWood commented 6 years ago

So MED17, there is only one copy in the human genome, but we have

http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1W2PRB8 and ~http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A1W2PRB8~

http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9NVC6 in GO

ValWood commented 6 years ago

I know it is only a small number (now), it is much improved, but we should find out what the problem is with the ingest that makes this possible. Presumably there is only one entry in reference proteomes. It's really really important to represent the human proteome uniquely and correctly for analysis.

cmungall commented 6 years ago

I think you meant this as the 2nd URL

http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q9NVC6

(which is the correct one)

This is how we get it from GOA:

curl -s -L ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human.gaf.gz | gzip -dc | grep MED17

e.g.

UniProtKB       A0A1W2PRB8      MED17           GO:0003712      GO_REF:0000002  IEA     InterPro:IPR019313      F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        InterPro
UniProtKB       A0A1W2PRB8      MED17           GO:0006351      GO_REF:0000038  IEA     UniProtKB-KW:KW-0804    P       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        UniProt         
UniProtKB       A0A1W2PRB8      MED17           GO:0006357      GO_REF:0000002  IEA     InterPro:IPR019313      P       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        InterPro                
UniProtKB       A0A1W2PRB8      MED17           GO:0016592      GO_REF:0000002  IEA     InterPro:IPR019313      C       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17     protein taxon:9606      20180616        InterPro                
UniProtKB       Q9NVC6  MED17           GO:0003712      PMID:10198638   IDA             F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80    protein taxon:9606      20030822        UniProt         
UniProtKB       Q9NVC6  MED17           GO:0003712      PMID:12218053   IDA             F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80    protein taxon:9606      20030822        UniProt         
UniProtKB       Q9NVC6  MED17           GO:0003713      PMID:12037571   IDA             F       Mediator of RNA polymerase II transcription subunit 17  MED17|MED17|ARC77|CRSP6|DRIP77|DRIP80|TRAP80    protein taxon:9606      20101104        MGI             
[snip]

@tonysawfordebi and @alexsign can you take a look (I assigned you Alex, but Tony can reassign to himself if appropriate)

@dougli1sqrd and @pgaudet should we have a soft check / warning for >1 ID with the same symbol in a species? It would be useful to have this kind of reporting information up-front.

ValWood commented 6 years ago

should we have a soft check / warning for >1 ID with the same symbol in a species?

yes please, that would be a useful QC check

selewis commented 6 years ago

@Chris Mungall cmungall@googlemail.com Not sure if this is relevant, but I know in PANTHER there is a many2many relationship between genes and proteins (though never both at once, just

1gene to 1 protein, or >1 protein to a gene)

-S

On Mon, Jul 2, 2018 at 3:04 PM Val Wood notifications@github.com wrote:

should we have a soft check / warning for >1 ID with the same symbol in a species?

yes please, that would be a useful QC check

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/geneontology/helpdesk/issues/139#issuecomment-401951925, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcuEF4dTXb5A9xY-tEwDWl5SbIENgqCks5uCpjfgaJpZM4U830h .

ValWood commented 6 years ago

found it https://github.com/pantherdb/db-PAINT/issues/1

On 03/07/2018 04:30, Suzanna Lewis wrote:

@Chris Mungall cmungall@googlemail.com Not sure if this is relevant, but I know in PANTHER there is a many2many relationship between genes and proteins (though never both at once, just

1gene to 1 protein, or >1 protein to a gene)

-S

On Mon, Jul 2, 2018 at 3:04 PM Val Wood notifications@github.com wrote:

should we have a soft check / warning for >1 ID with the same symbol in a species?

yes please, that would be a useful QC check

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub

https://github.com/geneontology/helpdesk/issues/139#issuecomment-401951925, or mute the thread

https://github.com/notifications/unsubscribe-auth/ABcuEF4dTXb5A9xY-tEwDWl5SbIENgqCks5uCpjfgaJpZM4U830h .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/helpdesk/issues/139#issuecomment-402002519, or mute the thread https://github.com/notifications/unsubscribe-auth/AHBLKDgE_EVPJO7KOBsQDFrH4qSAHd9Aks5uCuVRgaJpZM4U830h.

-- University of Cambridge PomBase http://www.pombase.org/ Cambridge Systems Biology Centre http://www.sysbiol.cam.ac.uk/Investigators/val-wood

tonysawfordebi commented 6 years ago

According to the data that we get from UniProt, both Q9NVC6 (Swiss-Prot) and A0A1W2PRB8 (TrEMBL) are canonical entries in the human GCRP, and they both have MED17 as the gene name, which doesn't seem right. I'll raise this with UniProt.

ValWood commented 6 years ago

Thansk @tonysawfordebi could you ask them to look at the full list.

tonysawfordebi commented 6 years ago

Roger that, @ValWood

ValWood commented 6 years ago

Proteins nearly good, RNA's, same issues looming https://github.com/geneontology/amigo/issues/511 I filed this on the AmiGO tracker, because that's where I saw the problem. But it's clearly the wrong place. Who would be the correct person for this part of the pipeline? https://github.com/geneontology/amigo/issues/511

tonysawfordebi commented 6 years ago

I just checked the list of genes from the top of this thread, and this is what I found:

Gene Name Entry Type
ATP6AP2 O75787 Swiss-Prot
ATP6AP2 A0A1C7CYW4 TrEMBL
CALM1 P0DP23 Swiss-Prot
EIF3F O00303 Swiss-Prot
GABARAP O95166 Swiss-Prot
GABARAP H6UMI1 TrEMBL
HOXD4 P09016 Swiss-Prot
HOXD4 A0A087WSZ3 TrEMBL
IDS P22304 Swiss-Prot
IDS B3KWA1 TrEMBL
KLK9 Q9UKQ9 Swiss-Prot
KLK9 Q2XQG4 TrEMBL
MED17 Q9NVC6 Swiss-Prot
MED17 A0A1W2PRB8 TrEMBL
MUC21 Q5SSG8 Swiss-Prot
MUC21 A0A0G2JKD1 TrEMBL
MUC21 A0A140T8X8 TrEMBL
NSG1 P42857 Swiss-Prot
PI4K2B Q8TCG2 Swiss-Prot
PI4K2B G5E9Z4 TrEMBL
SUPT3H O75486 Swiss-Prot
TMSB15B P0CG35 Swiss-Prot
TMSB15B A0A087X1C1 TrEMBL
TRAPPC2L Q9UL33 Swiss-Prot
TRAPPC2L H3BP13 TrEMBL

So, it appears that there's no ambiguity as far as CALM1, EIF3F, NSG1, and SUPT3H are concerned (there's only one canonical entry in the GCRP), but for the others there definitely appears to be something amiss (particularly MUC21).

ValWood commented 6 years ago

For CALM1 in AmiGO I see

http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P0DP23 and http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P62158

ValWood commented 6 years ago

For NSG1 I see http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A0A6YYJ2 http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P42857

tonysawfordebi commented 6 years ago

This is what we have (taken from UniProt):

Gene Entry Type Proteome Canonical Entry
CALM1 P0DP23 Swiss-Prot Canonical
CALM1 B4DJ51 TrEMBL none
CALM1 G3V479 TrEMBL Isoform P0DP23
CALM1 E7ETZ0 TrEMBL Isoform P0DP23
CALM1 Q96HY3 TrEMBL Isoform P0DP23
CALM1 M0QZ52 TrEMBL Isoform P0DP23
CALM1 G3V226 TrEMBL Isoform P0DP23
CALM1 G3V361 TrEMBL Isoform P0DP23
EIF3F O00303 Swiss-Prot Canonical
EIF3F A0A1W2PP79 TrEMBL Isoform O00303
EIF3F E9PQV8 TrEMBL Isoform O00303
EIF3F B3KSH1 TrEMBL none
EIF3F B4DMT5 TrEMBL none
EIF3F H0YDT6 TrEMBL Isoform O00303
NSG1 P42857 Swiss-Prot Canonical
NSG1 A0A0A6YYJ2 TrEMBL Isoform P42857
SUPT3H O75486 Swiss-Prot Canonical
SUPT3H Q5VWT9 TrEMBL Isoform O75486
SUPT3H B4E1H0 TrEMBL Isoform O75486
SUPT3H Q5U608 TrEMBL none
SUPT3H A0A024RD67 TrEMBL none
ValWood commented 6 years ago

actually these are different proteins, with the same name. How does that happen? http://amigo.geneontology.org/amigo/gene_product/UniProtKB:A0A0A6YYJ2 Neuron-specific protein family member 1

http://amigo.geneontology.org/amigo/gene_product/UniProtKB:P42857 Neuronal vesicle trafficking-associated protein 1

ValWood commented 6 years ago

CALM1 https://www.uniprot.org/uniprot/P62158 is obsolete

cmungall commented 6 years ago

I filed this on the AmiGO tracker, because that's where I saw the problem. But it's clearly the wrong place. Who would be the correct person for this part of the pipeline?

In this case it's inputs to the pipeline rather than the pipeline itself, and Tony is already on it. In general I would say the go-annotations tracker is good for coordinating with any contributing group about their annotations. helpdesk always fine for triaging, and I suggest keeping this discussion here to avoid breaking history.

dougli1sqrd commented 6 years ago

@cmungall Yeah we could do something like that. So, the check would have to find different gene product ids that have the same label? Do we have labels in the RDF for gene products? I can take a look.

kltm commented 6 years ago

@cmungall As whole tickets are ported, including comments, there would be no break in history.

@dougli1sqrd symbols are not globally unique; in fact, occasionally not locally either--it may be worth asking whether there should be a mechanism per-species.

ValWood commented 6 years ago

I was referring to this ticket being in the wrong place. There are no comments and no assignee.

geneontology/amigo#511

Should I move this one to the annotation tracker?

kltm commented 6 years ago

@ValWood I'll go ahead and move it.

kltm commented 6 years ago

Looks like I won't be moving it: https://github.com/google/github-issue-mover/issues/128#issuecomment-344261982

kltm commented 6 years ago

This issue was moved to geneontology/go-annotation#2021

kltm commented 6 years ago

Given mover workaround finally worked (remove assigned) after closing and re-starting https://github-issue-mover.appspot.com/

ValWood commented 6 years ago

Hi,

Sorry I never checked this, but it isn't fixed. I just checked the first 2 in the list and these human entries are still in duplicate in AmiGO.

atp6ap2

calm1

Or is there another open ticket? I followed the other tickets around but I could not find another ticket that described this precise problem, or any indication what the fix is.

Please let me know when the fix will be coming along, or if there is no open ticket, tell me where to log it.

Thanks

@cmungall @kltm @tonysawfordebi

pgaudet commented 6 years ago

@dustine32 Can you have a look at that ? If I understand the ticket correctly these may be coming from PAINT.

ValWood commented 6 years ago

Still current. Didn't we go through a release cycle yet?

pgaudet commented 6 years ago

@ValWood How do you find these ? I can't search on gene names (easily) in AmiGO.

cmungall commented 6 years ago

This does not have anything to do with PAINT.

I provided an explanation for this in my comment on July 2 (https://github.com/geneontology/helpdesk/issues/139#issuecomment-401950612)

as can be seen, there are different human IDs that share the same symbol. This is still the case, you can re-run the command I provided in my previous comment and get the same results.

cmungall commented 6 years ago

I re-assigned @alexsign / @tonysawfordebi (looks like this was de-assigned when we tried to move this ticket to go-annotation. I removed the go-annotation copy rather than fork the discussion).

Tony, on July 3 you said you would raise this with UniProtKB - are we any the wiser as to why this is happening?

All: is this something we should take preventative measures in the GO Central pipeline? It seems like something that should be handled upstream.

ValWood commented 6 years ago

@ValWood How do you find these ? I can't search on gene names (easily) in AmiGO.

Search on a name and hit enter (don't select from the drop down, since you would need to guess). You will get to a landing page which has a link to ALL gene products with the name. Select this, and then filter on human.

tonysawfordebi commented 6 years ago

@cmungall I did raise it with UniProt, and I haven't had a (satisfactory) response to date; I'll give them another prod.

I've just run a query in our database, and it looks like we have 50 genes in the human GCRP for which there are multiple UniProt accessions tagged as being the canonical entry for the gene, the winner being HERVK_113, for which these five Swiss-Prot entries are listed as being the canonical one: Q902F9, P63121, P62684, P63132, and P61574

alanwilter commented 6 years ago

Hi, I will try to explain the issue from our point of view, lets take an example: From table GENE_CENTRIC_ENTRY, for Gene Name MED17 (Q9NVC6 x A0A1W2PRB8):

ACCESSION   ENTRY_TYPE  NAME            LENGTH  GROUP_ID    TAX_ID  IS_CANONICAL    RELEASE GENE_NAME_TYPE  UPID
A0A1W2PRB8  1           MED17           838     46703535    9606                    2018_09 5               UP000005640
A0A1W2PRB8  1           ENSG00000284057 838     46703535    9606    1               2018_09 2               UP000005640
Q9NVC6      0           HGNC:2375       651     2571471     9606    1               2018_09 1               UP000005640
Q9NVC6      0           ENSG00000042429 651     2571471     9606                    2018_09 2               UP000005640
Q9NVC6      0           MED17           651     2571471     9606                    2018_09 5               UP000005640
Q9NVC6-2    0           HGNC:2375       145     2571471     9606                    2018_09 1               UP000005640
Q9NVC6-2    0           MED17           145     2571471     9606                    2018_09 5               UP000005640

The heuristic rules we have to generate such data above are:

Gene-Centric UniProt Reference Proteomes are created according to the summarised following criteria:

  1. Considering only Reference Proteomes species (UPID labeled reference), select all TR and SP accessions (ACC) that are alive and have keyword Reference Proteome, mapping any Gene Symbol (GS) these ACCs may have (MOD, ENSG, OLN, ORF or GN Name), plus sequence length, sorting them by UPID, from SWPPRO DB (see GeneCentricDao.java);
  2. Add an auxiliary mapping for isoforms (ACC-X) to ENSG using tables ensembl_uniprot, ensembl_gene, which only works for Ensembl species;
  3. Group the ACCs per "gene" by sorting all the ACCs whose GS intersect between them;
  4. Sort ACCs for a given "gene-group" by longest SP, then longest TR, the top one will be the canonical representative ACC for this given "gene-group";
  5. Two or more SP cannot be grouped together unless they share a common MOD;
  6. Two or more TR cannot be grouped together if they share only a common GN name and nothing else (paralogs), but they will be grouped if GN name is the only GS symbol they share;
  7. ACCs that failed to have a mapping to a GS will receive a '-' (dash) in the GS field, meaning 'unknown gene' for these ACCs;
  8. Create the UPID_TAXID.fasta file with only canonical representative ACCs;
  9. Create the UPID_TAXID_additional.fasta file with the rest of the ACCs. Their Fasta header will contain the canonical ACC as well;
  10. Create the UPID_TAXID.gene2acc mapping file (with both canonical and additional ACCs) linking only the GSs of the best class (obeying this order: MOD, ENSG, OLN, ORF or GN Name) to a given ACC, where the GS of the canonical ACC will identify the whole group (in a 3rd column);
  11. Create the UPID_TAXID_DNA.fasta file, which contains the Coding DNA sequences (CDS) for the proteins sequences in the canonical UPID_TAXID.fasta (i.e. only canonical ACCs);
  12. Create the UPID_TAXID.idmapping file, which contains the UniProtKB cross-references for all ACCs, both canonical and additional.

So, bottom line, I cannot group Q9NVC6 and A0A1W2PRB8 together because the only evidence for that is the Gene Name MED17, and it's not enough! It needs a higher rank evidence (Ensembl_id). This issue is spawned AFAIK, for this particular case at least, from our Automatic Annotation pipeline, where, apparently RuleBase:RU364140 has decided to assign such GN to it.

I will discuss with AA team about it and the other potential cases.

ValWood commented 6 years ago

But why is https://www.uniprot.org/uniprot/A0A1W2PRB8 and unreviewed Trembl entry included at all?

If Swiss-Prot already contain all of the validated human proteins, one per loci?

Presumably if a new protein is identified it is included rapidly in Swiss-Prot so this dataset would be a more robust set for the reference proteome? We shouldn't need to consider the Trembl entries at all?

alanwilter commented 6 years ago

This is a good point. We aim, for human, to address all TR cases, but I don't know when we'll get there. So, in a sense, if you want to play safe, you should consider only SP ones. TR are less the 5% of the cases.

ValWood commented 6 years ago

so in the mediator case, the protein comes from https://www.ebi.ac.uk/ena/data/view/AP001894 which is a very old DNA sequence (2000)

These aren't very useful. I wonder how many similar situations are included.

I had assumed that these days sequences which were not mapped unambiguously onto the reference genome assembly were ignored for the purposes of creating the reference proteome?

alanwilter commented 6 years ago

I check with AA and there's nothing wrong with the method to apply the GN MED17 to A0A1W2PRB8. Are you dealing only with Human? If so, then I'd re-assert to use only SP accessions in your project.

ValWood commented 6 years ago

OK, it makes sense to use only the UniProt proteins (which is what we did for our analyses, hence the discrepancies). I agree that there is nothing wrong with the mapping of MED17 to A0A1W2PRB8. But, the problem is that it is present in the Q for O reference proteome, and hence represented twice in GO.

In this case why not export only Swiss-Prot entries to the Quest for Orthologs reference proteome?

@selewis tagging Suzi for common interest in the Q for O dataset.

cmungall commented 6 years ago

Also cc-ing @thomaspd

Are you dealing only with Human? If so, then I'd re-assert to use only SP accessions in your project.

The GO deals with all species

While in principle we should simply elect to only take the SP subset for human, we don't want to be making these decisions for N genomes. The idea is that the QfO/GCRP group makes these decisions for each genome, and we defer to them.

I would second Val's question, shouldn't the QfO GCRP selection be more restrictive here?

alanwilter commented 6 years ago

The problem is how to be more restrictive avoiding hacking particular solutions for every species cases. So, I understand now that you use all 78 QfO species, and our pipeline here use the same rules for all of them, no exceptions, no tweaks, no special hacks. Currently we have this:

OSCODE  TAX_ID  TOTAL   TR      SP      SP_PERCENT
ANOGA   7165    12428   12179   249     2%
AQUAE   224324  1553    769     784     50%
ARATH   3702    27581   11937   15644   57%
ASPFU   330879  9648    8877    771     8%
BACSU   224308  4260    75      4185    98%
BACTN   226186  4782    4410    372     8%
BATDJ   684364  8610    8604    6       0%
BOVIN   9913    22008   16009   5999    27%
BRADU   224911  8253    7591    662     8%
BRAFL   7739    28542   28501   41      0%
CAEEL   6239    19921   15917   4004    20%
CANAL   237561  6035    5034    1001    17%
CANLF   9615    20269   19452   817     4%
CHICK   9031    19122   16832   2290    12%
CHLAA   324602  3850    3604    246     6%
CHLRE   3055    17608   17533   75      0%
CHLTR   272561  895     481     414     46%
CIOIN   7719    16678   16653   25      0%
CRYNJ   214684  6603    6235    368     6%
DANRE   7955    25289   22246   3043    12%
DEIRA   243230  3085    2572    513     17%
DICDI   44689   12738   8607    4131    32%
DICTD   515635  1743    1590    153     9%
DROME   7227    13767   10320   3447    25%
ECOLI   83333   4313    0       4313    100%
FUSNN   190304  2046    1690    356     17%
GEOSL   243231  3402    2990    412     12%
GIAIC   184922  7154    7137    17      0%
GLOVI   251221  4406    3965    441     10%
GORGO   9595    21796   21514   282     1%
HALSA   64091   2426    1935    491     20%
HELPY   85962   1553    959     594     38%
HELRO   6412    23328   23327   1       0%
HUMAN   9606    20996   820     20176   96%
IXOSC   6945    20468   20448   20      0%
KORCO   374847  1602    1525    77      5%
LEIMA   5664    8038    7989    49      1%
LEPIN   189518  3676    3292    384     10%
LEPOC   7918    18320   18320   0       0%
MAIZE   4577    39442   38658   784     2%
METAC   188937  4468    3958    510     11%
METJA   243232  1787    0       1787    100%
MONBE   81824   9188    9153    35      0%
MONDO   13616   21272   21229   43      0%
MOUSE   10090   22296   5386    16910   76%
MYCGE   243273  483     0       483     100%
MYCTU   83332   3993    1827    2166    54%
NEIMB   122586  2001    1419    582     29%
NEMVE   45351   24321   24197   124     1%
NEUCR   367110  9759    8906    853     9%
NITMS   436308  1795    1704    91      5%
ORYLA   8090    19698   19611   87      0%
ORYSJ   39947   43569   39645   3924    9%
PANTR   9598    23056   22363   693     3%
PARTE   5888    39461   39407   54      0%
PHANO   321614  15998   15757   241     2%
PHYPA   3218    30837   30763   74      0%
PHYRM   164328  15349   15349   0       0%
PLAF7   36329   5441    5278    163     3%
PSEAE   208964  5563    4239    1324    24%
PUCGT   418459  15688   15673   15      0%
RAT     10116   21465   13462   8003    37%
RHOBA   243090  7271    6909    362     5%
SCHPO   284812  5142    1       5141    100%
SCLS1   665079  14445   14279   166     1%
STRCO   100226  8038    7263    775     10%
SULSO   273057  2938    2466    472     16%
SYNY3   1111708 3507    2444    1063    30%
THAPS   35128   11717   11612   105     1%
THEKO   69014   2301    1882    419     18%
THEMA   243274  1852    1304    548     30%
THEYD   289376  1982    1779    203     10%
TRICA   7070    16564   16562   2       0%
TRIVA   5722    50190   50177   13      0%
USTMA   237631  6788    6430    358     5%
XENTR   8364    24208   22503   1705    7%
YARLI   284591  6448    5798    650     10%
YEAST   559292  6049    0       6049    100%

So, even if SP accessions were all fine (they aren't, see HERVK_113 and HERV-K104, human endogenous retrovirus group K), we have species with 0% SP accessions.

Are the issue you have with duplicate only in Human? Anyway, your issue remind us the problem we had 4 years ago, when Paul Thomas pointed out we were wrongly grouping accessions because of common GN (rule # 6 here ). For that reason we don't rely on GN when sorting gene-centric groups. For MOD species, like human, we use their MOD id, HGNC_id in case, but that's only 16 species so far.

We need further discussions, I may even meet Val in Cambridge.

ValWood commented 6 years ago

Happy to discuss, but we should see what people really need for this set.

It is important to get the human proteome representative, and 820 entries from Trembl seems a lot if most canonical human sequences are represented in UniProt

HUMAN 9606 20996 820 20176 96%

It would be great if the heuristic could be tweaked without detrimentally affecting the non-model species...

ValWood commented 6 years ago

The number 20996 seems very inflated for human? At least, the number we got after using the UniProt recommendations was was 19737 (after filtering transposons)

kltm commented 6 years ago

Do we have a next concrete action for this ticket? Does it have components to spin out into other trackers so there are more (domain-specific eyes on it)?

cmungall commented 6 years ago

Does uniprot have a github tracker for things like this? If not I suggest we host the discussion on go-annotation.

I would not attempt to move the ticket. I would create a new one and reference this one