Closed vincerubinetti closed 3 years ago
Here's the PR that I'm working on to show you how I'm hard coding it currently
Looks like I already have the info I need to do this myself with existing APIs:
A better way to obtain all the species that contain at least one gene: http://t.biothings.io/v1/query?q=has_gene:true&fields=taxid,scientific_name
Since there are so many organisms (> 24k), I wont be able to do the nice icons for each. However, perhaps we could do a selection of the most commonly used ones. @ravila4 could you put the top 100 organisms here in order?
In terms of what metric is considered "top", would that be "most genes of that organism"? Or "most searched for"? Or some other metric?
What about number of gene sets? I'm not sure if we have a way to query that efficiently.
We can count the number of genesets we have for each species using this query: https://mygeneset.info/v1/query?q=*&facets=taxid&facet_size=100
It's not a perfect method, because facet_size
has a maximum value of 1000
, but it 's not a big problem for now, as we don't have that many species in mygeneset.
I wrote a small bash script to get a csv of counts for each species in mygeneset and their scientific names:
#!/bin/bash
# Genesets aggregated by taxid
aggs=`curl -s "https://mygeneset.info/v1/query?q=*&facets=taxid&facet_size=100"`
taxids=`echo $aggs | jq -r '.facets.taxid.terms | map(.term) | @csv'`
counts=`echo $aggs | jq -r '.facets.taxid.terms | map(.count) | @csv'`
# Query scientific name for each taxid
resp=`curl -s -X POST -d "q=${taxids}" "http://t.biothings.io/v1/query"`
species=`echo $resp | jq -r 'map(.scientific_name) | @csv'`
echo "${taxids}
${species}
${counts}" | rs -c, -C, -T
Here's the output:
taxid | scientific name | number of genesets |
---|---|---|
9606 | homo sapiens | 59283 |
9031 | gallus gallus | 14771 |
9913 | bos taurus | 14520 |
9823 | sus scrofa | 14158 |
9615 | canis lupus familiaris | 13942 |
10090 | mus musculus | 716 |
10116 | rattus norvegicus | 669 |
7955 | danio rerio | 434 |
559292 | saccharomyces cerevisiae s288c | 401 |
3702 | arabidopsis thaliana | 366 |
6239 | caenorhabditis elegans | 339 |
208964 | pseudomonas aeruginosa pao1 | 329 |
7227 | drosophila melanogaster | 328 |
9598 | pan troglodytes | 46 |
180454 | anopheles gambiae str. pest | 14 |
3694 | populus trichocarpa | 5 |
9796 | equus caballus | 5 |
39947 | oryza sativa japonica group | 1 |
@vincerubinetti , if you need to query many taxids at the same time, it will be faster to use a batch POST request as in the example.
@ravila4 Is there a reason you did http://t.biothings.io/v1/query?q=TAXID
rather than http://t.biothings.io/v1/taxon/TAXID
? With the former, tax id 9031
, for example, returns "carpinus sp. wen 9031" (tax id of 559512) as the first result, and "gallus gallus" (tax id of 9031) as the second result.
@vincerubinetti Because POST queries only work with the query
endpoint, not taxon
.
There is also some difference in the fields that GET and POST queries search by default.
The first query using GET GET: http://t.biothings.io/v1/query?q=9031
would search all the default fields, including taxid, and scientific name. The way to limit it to searching only taxid is to use q=taxid:9031
. I suspect that is why you are getting "carpinus sp. wen 9031" as a response.
However if you use POST method, the default search field is _id, which is the same as taxid for this API, but you can also be explicit and pass a scopes=taxid
field .
Here's an example:
POST: http://t.biothings.io/v1/query?q=9031,9606&scopes=taxid
Here's some documentation on POST queries from mygene.info: https://docs.mygene.info/en/latest/doc/query_service.html?highlight=post#batch-queries-via-post
The same should apply to the taxonomy API, and mygeneset.
Current in the web app, I'm hardcoding in the species. It'd be good to have an endpoint to get the following info:
Also side note, Google seems to say that "fruit fly" is two words, so
query?species=fruitfly
should probably bequery?species=fruit-fly
.