Open ftwkoopmans opened 2 years ago
Just to add a tiny bit more info. I suspect the difference in behavior between P63044
and P23819
is due to the lack of an Entrez Gene mapping in the UniProt file for P23819
.
The source file for the uniprot data plugin appears to be https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz.
From the README, the column headings for this file are as follows:
1. UniProtKB-AC
2. UniProtKB-ID
3. GeneID (EntrezGene)
4. RefSeq
5. GI
6. PDB
7. GO
8. UniRef100
9. UniRef90
10. UniRef50
11. UniParc
12. PIR
13. NCBI-taxon
14. MIM
15. UniGene
16. PubMed
17. EMBL
18. EMBL-CDS
19. Ensembl
20. Ensembl_TRS
21. Ensembl_PRO
22. Additional PubMed
Note the difference in the records below in column 3 which should have a mapping to Entrez Gene.
$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P63044"' | tr "\t" "\n" | cat -n | head
1 P63044
2 VAMP2_MOUSE
3 22318
4 NP_033523.1
5 51704193; 6678551
6

8 UniRef100_P63044
9 UniRef90_P63044
10 UniRef50_P63044
$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P23819"' | tr "\t" "\n" | cat -n | head
1 P23819
2 GRIA2_MOUSE
3
4
5 496139; 22096313; 26335713; 496140; 12852206
6 7LDD:B; 7LDD:D; 7LDE:B; 7LDE:D; 7LEP:B; 7LEP:D

8 UniRef100_P23819
9 UniRef90_P19491-3
10 UniRef50_P19491
This difference can also be seen on the corresponding UniProt web pages
Having said that, the reciprocal links do exist in NCBI Gene (likely through a mapping to Refseq Protein):
Some uniprot accessions are not available for querying nor as output in the "uniprot" field/scope. To illustrate I've included 2 examples, one accession that works (P63044) and one that fails (P23819).
this works via https://mygene.info/v3/api#/query/get_query ; "q" input: P63044 "fields" input: symbol,name,taxid,entrezgene,uniprot
returns:
this works via https://mygene.info/v3/api#/query/get_query ; in "q" input: P23819 in "fields" input: symbol,name,taxid,entrezgene,uniprot
and returns:
However, note that for the latter query, the uniprot input ID that I queried (a swissprot record) is not included in the "uniprot" output field! So it seems there is a problem with the mygene.info database, possibly a subset of uniprot accessions/IDs are not stored/linked under "uniprot". Other examples are P23819, Q61941, Q8VHW2.
Furthermore, POST queries against these accessions fail even though they should not (probably same root cause).
this works via https://mygene.info/v3/api#/query/post_query ;
{ "q": "P63044", "scopes": "uniprot" }
returns:this query fails, but it should not as this is a valid uniprot accesion that is in the mygene.info dataset (see GET query above) ;
{ "q": "P23819", "scopes": "uniprot" }
returns: