biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
116 stars 20 forks source link

queries fail for some uniprot accessions #128

Open ftwkoopmans opened 2 years ago

ftwkoopmans commented 2 years ago

Some uniprot accessions are not available for querying nor as output in the "uniprot" field/scope. To illustrate I've included 2 examples, one accession that works (P63044) and one that fails (P23819).

this works via https://mygene.info/v3/api#/query/get_query ; "q" input: P63044 "fields" input: symbol,name,taxid,entrezgene,uniprot

returns:

{
  "took": 16,
  "total": 1,
  "max_score": 17.406927,
  "hits": [
    {
      "_id": "22318",
      "_score": 17.406927,
      "entrezgene": "22318",
      "name": "vesicle-associated membrane protein 2",
      "symbol": "Vamp2",
      "taxid": 10090,
      "uniprot": {
        "Swiss-Prot": "P63044",
        "TrEMBL": "Q8CHR4"
      }
    }
  ]
}

this works via https://mygene.info/v3/api#/query/get_query ; in "q" input: P23819 in "fields" input: symbol,name,taxid,entrezgene,uniprot

and returns:

{
  "took": 13,
  "total": 1,
  "max_score": 7.8478303,
  "hits": [
    {
      "_id": "14800",
      "_score": 7.8478303,
      "entrezgene": "14800",
      "name": "glutamate receptor, ionotropic, AMPA2 (alpha 2)",
      "symbol": "Gria2",
      "taxid": 10090,
      "uniprot": {
        "TrEMBL": "Q4LG64"
      }
    }
  ]
}

However, note that for the latter query, the uniprot input ID that I queried (a swissprot record) is not included in the "uniprot" output field! So it seems there is a problem with the mygene.info database, possibly a subset of uniprot accessions/IDs are not stored/linked under "uniprot". Other examples are P23819, Q61941, Q8VHW2.

Furthermore, POST queries against these accessions fail even though they should not (probably same root cause).

this works via https://mygene.info/v3/api#/query/post_query ; { "q": "P63044", "scopes": "uniprot" } returns:

[
  {
    "query": "P63044",
    "_id": "22318",
    "_score": 16.7524,
    "entrezgene": "22318",
    "name": "vesicle-associated membrane protein 2",
    "symbol": "Vamp2",
    "taxid": 10090
  }
]

this query fails, but it should not as this is a valid uniprot accesion that is in the mygene.info dataset (see GET query above) ; { "q": "P23819", "scopes": "uniprot" } returns:

[
  {
    "query": "P23819",
    "notfound": true
  }
]
andrewsu commented 2 years ago

Just to add a tiny bit more info. I suspect the difference in behavior between P63044 and P23819 is due to the lack of an Entrez Gene mapping in the UniProt file for P23819.

The source file for the uniprot data plugin appears to be https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz.

From the README, the column headings for this file are as follows:

1. UniProtKB-AC
2. UniProtKB-ID
3. GeneID (EntrezGene)
4. RefSeq
5. GI
6. PDB
7. GO
8. UniRef100
9. UniRef90
10. UniRef50
11. UniParc
12. PIR
13. NCBI-taxon
14. MIM
15. UniGene
16. PubMed
17. EMBL
18. EMBL-CDS
19. Ensembl
20. Ensembl_TRS
21. Ensembl_PRO
22. Additional PubMed

Note the difference in the records below in column 3 which should have a mapping to Entrez Gene.

$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P63044"' | tr "\t" "\n" | cat -n | head
     1  P63044
     2  VAMP2_MOUSE
     3  22318
     4  NP_033523.1
     5  51704193; 6678551
     6
     7  GO:0030136; GO:0060203; GO:0005737; GO:0031410; GO:0030659; GO:0030285; GO:0043231; GO:0043229; GO:0016020; GO:0043005; GO:0044306; GO:0048471; GO:0005886; GO:0030141; GO:0030667; GO:0031201; GO:0000322; GO:0045202; GO:0008021; GO:0030672; GO:0070044; GO:0070032; GO:0070033; GO:0005802; GO:0031982; GO:0042589; GO:0048306; GO:0005516; GO:0042802; GO:0017022; GO:0005543; GO:0008022; GO:0044877; GO:0005484; GO:0000149; GO:0019905; GO:0017075; GO:0044325; GO:0017156; GO:0032869; GO:0043308; GO:0098967; GO:0043001; GO:0046879; GO:0060291; GO:0061025; GO:0090316; GO:0015031; GO:0065003; GO:0045055; GO:0017158; GO:1902259; GO:0017157; GO:1903421; GO:0060627; GO:0009749; GO:0035493; GO:0016081; GO:0048488; GO:0016079; GO:0006906; GO:0016192
     8  UniRef100_P63044
     9  UniRef90_P63044
    10  UniRef50_P63044
$ gzip -cd idmapping_selected.tab.gz | awk '$1=="P23819"' | tr "\t" "\n" | cat -n | head
     1  P23819
     2  GRIA2_MOUSE
     3
     4
     5  496139; 22096313; 26335713; 496140; 12852206
     6  7LDD:B; 7LDD:D; 7LDE:B; 7LDE:D; 7LEP:B; 7LEP:D
     7  GO:0032281; GO:0032279; GO:0009986; GO:0030425; GO:0032839; GO:0043198; GO:0043197; GO:0005783; GO:0005789; GO:0098978; GO:0030426; GO:0005887; GO:0099061; GO:0099055; GO:0099056; GO:0016020; GO:0043005; GO:0043025; GO:0043204; GO:0099544; GO:0005886; GO:0014069; GO:0098839; GO:0045211; GO:0042734; GO:0032991; GO:0098685; GO:0036477; GO:0045202; GO:0097060; GO:0008021; GO:0030672; GO:0043195; GO:0004971; GO:0001540; GO:0051117; GO:0008092; GO:0005234; GO:0035254; GO:0042802; GO:0019865; GO:0004970; GO:0015277; GO:0015276; GO:0030165; GO:0019901; GO:0038023; GO:0000149; GO:1904315; GO:0007268; GO:0045184; GO:0035235; GO:0050806; GO:0051262; GO:0031623; GO:0001919; GO:0051966
     8  UniRef100_P23819
     9  UniRef90_P19491-3
    10  UniRef50_P19491

This difference can also be seen on the corresponding UniProt web pages

Having said that, the reciprocal links do exist in NCBI Gene (likely through a mapping to Refseq Protein):