Query to return species metadata

vincerubinetti commented 3 years ago

Current in the web app, I'm hardcoding in the species. It'd be good to have an endpoint to get the following info:

[
  ...,
  {
    "common": "Brewer's Yeast", // common name, human friendly, capitalized, for display purposes
    "scientific": "Saccharomyces cerevisiae", // scientific name, also for display purposes
    "search": "brewers-yeast" // key that is put into species search endpoint url
  }
  ...,
]

Also side note, Google seems to say that "fruit fly" is two words, so query?species=fruitfly should probably be query?species=fruit-fly.

vincerubinetti commented 3 years ago

Here's the PR that I'm working on to show you how I'm hard coding it currently

https://github.com/biothings/mygeneset.info-website/pull/14/files#diff-ae13e3f616decf65761c7604f81fcf8e157404f63e6837b4405967a04c5766ae

vincerubinetti commented 3 years ago

Looks like I already have the info I need to do this myself with existing APIs:

https://mygene.info/v3/query?q=*&facets=taxid

http://t.biothings.io/v1/taxon/9913

ravila4 commented 3 years ago

A better way to obtain all the species that contain at least one gene: http://t.biothings.io/v1/query?q=has_gene:true&fields=taxid,scientific_name

vincerubinetti commented 3 years ago

Since there are so many organisms (> 24k), I wont be able to do the nice icons for each. However, perhaps we could do a selection of the most commonly used ones. @ravila4 could you put the top 100 organisms here in order?

In terms of what metric is considered "top", would that be "most genes of that organism"? Or "most searched for"? Or some other metric?

cgreene commented 3 years ago

What about number of gene sets? I'm not sure if we have a way to query that efficiently.

ravila4 commented 3 years ago

We can count the number of genesets we have for each species using this query: https://mygeneset.info/v1/query?q=*&facets=taxid&facet_size=100

It's not a perfect method, because facet_size has a maximum value of 1000, but it 's not a big problem for now, as we don't have that many species in mygeneset.

I wrote a small bash script to get a csv of counts for each species in mygeneset and their scientific names:


#!/bin/bash

# Genesets aggregated by taxid
aggs=`curl -s "https://mygeneset.info/v1/query?q=*&facets=taxid&facet_size=100"`
taxids=`echo $aggs | jq -r '.facets.taxid.terms | map(.term) | @csv'`
counts=`echo $aggs | jq -r '.facets.taxid.terms | map(.count) | @csv'`

# Query scientific name for each taxid
resp=`curl -s -X POST -d "q=${taxids}" "http://t.biothings.io/v1/query"`
species=`echo $resp | jq -r 'map(.scientific_name) | @csv'`

echo "${taxids}
${species}
${counts}" | rs -c, -C, -T

Here's the output:

taxid	scientific name	number of genesets
9606	homo sapiens	59283
9031	gallus gallus	14771
9913	bos taurus	14520
9823	sus scrofa	14158
9615	canis lupus familiaris	13942
10090	mus musculus	716
10116	rattus norvegicus	669
7955	danio rerio	434
559292	saccharomyces cerevisiae s288c	401
3702	arabidopsis thaliana	366
6239	caenorhabditis elegans	339
208964	pseudomonas aeruginosa pao1	329
7227	drosophila melanogaster	328
9598	pan troglodytes	46
180454	anopheles gambiae str. pest	14
3694	populus trichocarpa	5
9796	equus caballus	5
39947	oryza sativa japonica group	1

@vincerubinetti , if you need to query many taxids at the same time, it will be faster to use a batch POST request as in the example.

vincerubinetti commented 3 years ago

@ravila4 Is there a reason you did http://t.biothings.io/v1/query?q=TAXID rather than http://t.biothings.io/v1/taxon/TAXID? With the former, tax id 9031, for example, returns "carpinus sp. wen 9031" (tax id of 559512) as the first result, and "gallus gallus" (tax id of 9031) as the second result.

ravila4 commented 3 years ago

@vincerubinetti Because POST queries only work with the query endpoint, not taxon.

There is also some difference in the fields that GET and POST queries search by default. The first query using GET GET: http://t.biothings.io/v1/query?q=9031 would search all the default fields, including taxid, and scientific name. The way to limit it to searching only taxid is to use q=taxid:9031. I suspect that is why you are getting "carpinus sp. wen 9031" as a response.

However if you use POST method, the default search field is _id, which is the same as taxid for this API, but you can also be explicit and pass a scopes=taxid field .

Here's an example:

POST: http://t.biothings.io/v1/query?q=9031,9606&scopes=taxid

ravila4 commented 3 years ago

Here's some documentation on POST queries from mygene.info: https://docs.mygene.info/en/latest/doc/query_service.html?highlight=post#batch-queries-via-post

The same should apply to the taxonomy API, and mygeneset.

biothings / mygeneset.info

Query to return species metadata #26