biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

Query to return species metadata #26

Closed vincerubinetti closed 3 years ago

vincerubinetti commented 3 years ago

Current in the web app, I'm hardcoding in the species. It'd be good to have an endpoint to get the following info:

[
  ...,
  {
    "common": "Brewer's Yeast", // common name, human friendly, capitalized, for display purposes
    "scientific": "Saccharomyces cerevisiae", // scientific name, also for display purposes
    "search": "brewers-yeast" // key that is put into species search endpoint url
  }
  ...,
]

Also side note, Google seems to say that "fruit fly" is two words, so query?species=fruitfly should probably be query?species=fruit-fly.

vincerubinetti commented 3 years ago

Here's the PR that I'm working on to show you how I'm hard coding it currently

https://github.com/biothings/mygeneset.info-website/pull/14/files#diff-ae13e3f616decf65761c7604f81fcf8e157404f63e6837b4405967a04c5766ae

vincerubinetti commented 3 years ago

Looks like I already have the info I need to do this myself with existing APIs:

https://mygene.info/v3/query?q=*&facets=taxid

http://t.biothings.io/v1/taxon/9913

ravila4 commented 3 years ago

A better way to obtain all the species that contain at least one gene: http://t.biothings.io/v1/query?q=has_gene:true&fields=taxid,scientific_name

vincerubinetti commented 3 years ago

Since there are so many organisms (> 24k), I wont be able to do the nice icons for each. However, perhaps we could do a selection of the most commonly used ones. @ravila4 could you put the top 100 organisms here in order?

In terms of what metric is considered "top", would that be "most genes of that organism"? Or "most searched for"? Or some other metric?

cgreene commented 3 years ago

What about number of gene sets? I'm not sure if we have a way to query that efficiently.

ravila4 commented 3 years ago

We can count the number of genesets we have for each species using this query: https://mygeneset.info/v1/query?q=*&facets=taxid&facet_size=100

It's not a perfect method, because facet_size has a maximum value of 1000, but it 's not a big problem for now, as we don't have that many species in mygeneset.

I wrote a small bash script to get a csv of counts for each species in mygeneset and their scientific names:


#!/bin/bash

# Genesets aggregated by taxid
aggs=`curl -s "https://mygeneset.info/v1/query?q=*&facets=taxid&facet_size=100"`
taxids=`echo $aggs | jq -r '.facets.taxid.terms | map(.term) | @csv'`
counts=`echo $aggs | jq -r '.facets.taxid.terms | map(.count) | @csv'`

# Query scientific name for each taxid
resp=`curl -s -X POST -d "q=${taxids}" "http://t.biothings.io/v1/query"`
species=`echo $resp | jq -r 'map(.scientific_name) | @csv'`

echo "${taxids}
${species}
${counts}" | rs -c, -C, -T

Here's the output:

taxid scientific name number of genesets
9606 homo sapiens 59283
9031 gallus gallus 14771
9913 bos taurus 14520
9823 sus scrofa 14158
9615 canis lupus familiaris 13942
10090 mus musculus 716
10116 rattus norvegicus 669
7955 danio rerio 434
559292 saccharomyces cerevisiae s288c 401
3702 arabidopsis thaliana 366
6239 caenorhabditis elegans 339
208964 pseudomonas aeruginosa pao1 329
7227 drosophila melanogaster 328
9598 pan troglodytes 46
180454 anopheles gambiae str. pest 14
3694 populus trichocarpa 5
9796 equus caballus 5
39947 oryza sativa japonica group 1

@vincerubinetti , if you need to query many taxids at the same time, it will be faster to use a batch POST request as in the example.

vincerubinetti commented 3 years ago

@ravila4 Is there a reason you did http://t.biothings.io/v1/query?q=TAXID rather than http://t.biothings.io/v1/taxon/TAXID? With the former, tax id 9031, for example, returns "carpinus sp. wen 9031" (tax id of 559512) as the first result, and "gallus gallus" (tax id of 9031) as the second result.

ravila4 commented 3 years ago

@vincerubinetti Because POST queries only work with the query endpoint, not taxon.

There is also some difference in the fields that GET and POST queries search by default. The first query using GET GET: http://t.biothings.io/v1/query?q=9031 would search all the default fields, including taxid, and scientific name. The way to limit it to searching only taxid is to use q=taxid:9031. I suspect that is why you are getting "carpinus sp. wen 9031" as a response.

However if you use POST method, the default search field is _id, which is the same as taxid for this API, but you can also be explicit and pass a scopes=taxid field .

Here's an example:

POST: http://t.biothings.io/v1/query?q=9031,9606&scopes=taxid

ravila4 commented 3 years ago

Here's some documentation on POST queries from mygene.info: https://docs.mygene.info/en/latest/doc/query_service.html?highlight=post#batch-queries-via-post

The same should apply to the taxonomy API, and mygeneset.