Make gene functions discoverable

sidneymbell commented 6 years ago

I think it's important to contextualize diffexp results by what these genes do, biologically. There are multiple "levels" at which we could support this:
1) Just link out to wikigene / human protein atlas / gene ontologies as Ben suggests below.
2) Group differentially expressed genes by ontology classification/tags, make this an option to color by (e.g., I find that cell set 1 is expressing high levels of a bunch of genes that are involved in "lipid metabolism". Color all cells by their mean expression of genes with this GO tag.)
3) Support GO enrichment analysis in-app

From my notes:

What do these genes do (e.g., gene ontologies)?

From Ben Humphrey's notes:

It would be really helpful on the Expression tab if you could include a link on the gene name that would open a new window into something about that gene - whether WikiGene, or Human Protain Atlas for that gene (my preference), or some other source of info - so I don’t need to just google it myself.

sidneymbell commented 6 years ago

A related comment from Ben:

Another idea - would be SUPER COOL if I could select two cell types that I know are adjecent to one another in the kidney, then click a button such that one cluster then shows all receptors it expresses, the other cluster all the ligands it expresses - so I could get at intercellular communication…that would be amazing…

In a practical sense, this could be very similar to the above idea for color-by-function (both involve filtering genes by function / GO classifications)

sidneymbell commented 5 years ago

This was raised as a desired feature by the most recent feedback session w/ Ben Humphreys' lab, led by @neuromusic w/ @fionagriffin. I think it's worth revisiting.

The suggested GO annotations have a REST API, available here. I believe this would satisfy the requirements for an API that @colinmegill has been enthusiastic about finding / searching for?

One option for a tooltip as suggested by Colin could look like this: gene-ontologies

neuromusic commented 5 years ago

another API discovered w/ @colinmegill this afternoon: mygene.info by @andrewsu's group

the API query would involve (at least) two steps:

query for gene name to get entrez name
get aggregated gene info from entrez name

querying mygene.info for each gene in the the tabula muris h5ad file resulted in...

92% of all genes yielding an entrez id
60% of all genes yielding a refseq summary through the API

example using the python wrapper for the API: https://gist.github.com/neuromusic/6ab7769c2030eec573b61b03a8021620

andrewsu commented 5 years ago

A few quick notes...

instead of looping over mg.query, you can also perform batch queries via mg.querymany, as described in https://pypi.org/project/mygene/
to perform the query and get specific annotation fields in one step, use the currently undocumented fields parameter (e.g., mg.querymany(['1500015L24Rik','1500016L03Rik','Zhx1', 'Zrsr2'],scopes='symbol',fields='entrezgene,summary,symbol')
- EDIT: Note that the sparse documentation of the python client notwithstanding, I believe it does implement all the features described in the main mygene.info documentation
mygene.info and the whole suite of BioThings APIs is primarily led by @newgene

Further questions and feedback are of course always welcome!

cornhundred commented 4 years ago

This is similar to what we're doing with Clustergrammer-JS's and Clustergrammer2's biology specific features. Mousing over a gene row looks up the gene name and refseq via the Harmonizome. Similarly, enrichment analysis is done via Enrichr. We have back-end (Python) and front-end (JavaScript) implementations Enrichr.

Let us know if that sounds like what you would like to implement and if we can help.

sidneymbell commented 4 years ago

Hi @cornhundred -- thank you so much for the suggestion!

We'll have to look into whether their license is compatible with ours, but I super appreciate the pointer! It looks like a great resource (and Clustergrammer looks like a cool tool :).

cornhundred commented 4 years ago

Hi @sidneymbell, feel free to contact the Ma'ayan lab about their licenses (I'm pretty sure they're permissive), Harmonizome-license.

We're glad you like Clustergrammer! The Clustergrammer2 widget we are working on has a lot of similarities with cellxgene: we're using regl, Python back-end, built for single cell gene expression data. Feel free to check out the Clustergrammer2-notebooks repo: https://github.com/ismms-himc/clustergrammer2-notebooks for some example workflows (see video below):

We would love feedback and I'm sure we will reach out to you all about cross-tool compatibility, etc. in the future :)

sidneymbell commented 4 years ago

👍👍

On Mon, Jul 22, 2019 at 5:12 PM Nicolas Fernandez notifications@github.com wrote:

Hi @sidneymbell https://github.com/sidneymbell, feel free to contact the Ma'ayan lab about their licenses (I'm pretty sure they're permissive), Harmonizome-license https://github.com/MaayanLab/harmonizome/blob/master/LICENSE.

We're glad you like Clustergrammer! The Clustergrammer2 widget we are working on has a lot of similarities with cellxgene: we're using regl, Python back-end, built for single cell gene expression data. Feel free to check out the Clustergrammer2-notebooks repo: https://github.com/ismms-himc/clustergrammer2-notebooks for some example workflows (see video below):

[image: 2,700 PBMC scRNA-seq] http://www.youtube.com/watch?v=BEPspcC7vIY

We would love feedback and I'm sure we will reach out to you all about cross-tool compatibility, etc. in the future :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene/issues/96?email_source=notifications&email_token=ADAIYX6GLHIVARO7T3Q2MMTQAXE57A5CNFSM4FGKCEH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2QHZGY#issuecomment-513834139, or mute the thread https://github.com/notifications/unsubscribe-auth/ADAIYX26E2BKCBJ6MN4IHBLQAXE57ANCNFSM4FGKCEHQ .

sidneymbell commented 4 years ago

Another option that was suggested today by the GO folks: https://github.com/biolink/ontobio

I haven't looked into this extensively, but it's got a permissive license (BSD-3)

sidneymbell commented 4 years ago

We want to make gene functions discoverable from within the app by pulling in data from public databases.

Implementation:

launch will need to map input var_names to gene identifiers; many of the APIs listed below take care of mapping between the various naming schemes, but it is still possible that a user would input a matrix with names like zebra, in which case we should not try and fetch gene function data.
The client should lazy-load gene function data as a given column is accessed. If possible, this would ideally be via an API to an external database. Alternatively, basic gene ontology tags could be pulled from a static copy of GO that is packaged with the repo. Species (at least human and mouse) should be specified via the CLI so the correct annotations are fetched.
Initial idea was a tooltip that surfaces the short gene description and top ontology tag, and expands on click to show the full summary and link out to NCBI and Humanbase.
We should inform users that we may send their gene names to an external data source, point them to the relevant privacy policies and terms of service, and make this feature disable-able on the CLI

Data source options

An ideal data source for this feature would:
Be regularly updated
Have a compatible open-source license
Have a robust, RESTful API
Be reliable (e.g., provided by a relatively stable institution)
Bonus: provide additional derived information about genes / relevant networks / etc. (i.e., surface reasonable metaanalysis of multiple public data sources)

Initial landscaping surfaced quite a few options for data sources. I’ve highlighted some of the most appealing options below with pros/cons, but there are probably also other good options out there. (See appendix for a list of options I don’t think are a good fit.)

Recommendation

NCBI gene database: entrez API URL: https://www.ncbi.nlm.nih.gov/gene About: https://www.ncbi.nlm.nih.gov/books/NBK25501/ License: https://www.ncbi.nlm.nih.gov/home/about/policies/ Pros: Direct access to a wide range of frequently-updated descriptive information of gene function in many species Cons: I haven’t yet found a set of JS-based wrapper functions, although the Python API is quite robust

Humanbase URL: https://hb.flatironinstitute.org/api/ About: https://hb.flatironinstitute.org/about License: CC-BY 4.0 (per direct communication, in process of adding to docs) Diligence in progress: compatible licensing and methods validation Pros: Surfaces interacting genes, functional processes, and tissue-specific expression I would imagine that support from the flatiron institute is pretty stable? Cons: License is not yet publicly documented on their site

Other sources I considered

Gene Ontology Consortium: AmiGO (GOlr) URL: http://wiki.geneontology.org/index.php/AmiGO_2_Manual:_JavaScript About: https://link.springer.com/protocol/10.1007/978-1-4939-3743-1_11 License: Creative Commons Attribution 4.0 Unported License Pros Direct access to the most up-to-date gene ontologies. Offers API for on-demand queries OR direct download of ontologies file that could be packaged into each release (~8MB; advantage is that this would not require an internet connection or sending information outside of the app). Cons Only pulls from the GO consortium / doesn’t offer any additional information directly API appears somewhat confusing

Mygene.info URL: https://mygene.info/ About: https://mygene.info/about License: Apache 2.0 Pros Weekly updated gene ontologies access API is RESTful and documentation is good Cons Only pulls from the GO consortium / doesn’t offer any additional information directly Unclear how stable the source is

Harmonizome URL: https://amp.pharm.mssm.edu/Harmonizome/gene/BRCA1 Pros: Nice visual display of most of the information present in the NCBI gene database + a few others Cons: Doesn’t offer a huge amount of additional information compared to NCBI, and adds another layer of dependency

GeneNetwork.nl URL: https://www.genenetwork.nl/faq All associations putatively based on co-regulation in bulk RNAseq

Other resources with a different use-case Reverse search (function → genes) https://amp.pharm.mssm.edu/geneshot/api.html Mendelian disease focus: https://www.omim.org/about Commercial: Gene Cards Gene set enrichment: webgestalt.org, geneweaver, DiVenn

neuromusic commented 4 years ago

re: mygene.info

Only pulls from the GO consortium / doesn’t offer any additional information directly

Doesn't this service aggregate a bunch of data sources? https://docs.mygene.info/en/latest/doc/data.html

Was this a mis-copy from the "Gene Ontology Consortium: AmiGO (GOlr)" entry above?

andrewsu commented 4 years ago

Just a bit more info on mygene.info in case it's useful:

We pull GO annotations from NCBI's gene2go file, so should have the same data as the Entrez API
We generally refresh all sources weekly, more info about the data we load at https://mygene.info/metadata
For each GO annotation, we report evidence code, PMID(s), and qualifiers (e.g., "NOT") (but true that we not pull in PMID journal name / title / authors, for example)
30-day uptime is tracked at https://mygene.info/#api-status and https://stats.uptimerobot.com/7y2AFWAE, currently at 100%
If "function" might include roles in biological pathways and/or presence of interpro domains, we've got that too!

Clearly I'm biased, but seems like you've got several good options for your use case here!

sidneymbell commented 4 years ago

@neuromusic -- yes, that was a copy/paste error, thanks for catching :) @andrewsu -- thanks for sharing! Mygene.info sounds like an awesome tool. I think in this case, we can get the data we need directly from entrez without needing an additional dependency. I'll definitely keep mygene.info in mind if that changes, though!

colinmegill commented 3 years ago

@sidneymbell @neuromusic @ambrosejcarr it seems as if someone has done the thing:

http://amp.pharm.mssm.edu/Harmonizome/api/1.0/gene/apod

{"symbol":"APOD","synonyms":[],"name":"apolipoprotein D","description":"This gene encodes a component of high density lipoprotein that has no marked similarity to other apolipoprotein sequences. It has a high degree of homology to plasma retinol-binding protein and other members of the alpha 2 microglobulin protein superfamily of carrier proteins, also known as lipocalins. This glycoprotein is closely associated with the enzyme lecithin:cholesterol acyltransferase - an enzyme involved in lipoprotein metabolism. [provided by RefSeq, Aug 2008]","ncbiEntrezGeneId":347,"ncbiEntrezGeneUrl":"http://www.ncbi.nlm.nih.gov/gene/347","proteins":[{"symbol":"APOD_HUMAN","href":"/api/1.0/protein/APOD_HUMAN"}],"hgncRootFamilies":[{"name":"Calycin structural superfamily","href":"/api/1.0/gene_family/Calycin+structural+superfamily"},{"name":"Apolipoproteins (APO)","href":"/api/1.0/gene_family/Apolipoproteins+%28APO%29"}]}

cornhundred commented 3 years ago

gene_info

@colinmegill @sidneymbell @neuromusic @ambrosejcarr Yes, when we were building the Harmonizome at the Ma'ayan lab we made sure to make it CORS compatible (https://clustergrammer.readthedocs.io/biology_specific_features.html#mouseover-gene-name-and-description).

We have this example on ObservableHQ (https://observablehq.com/@ismms-himc/covid-19-transcriptional-signature-tenoever-data-a549?collection=@ismms-himc/ismms-himc-covid-19) that shows you can talk to Enrichr (for enrichment analyssis) and Harmonizome via Clustergrammer-GL and some REST get requests.

colinmegill commented 3 years ago

I do apologize for not realizing this was JSON, in the thread above :)

ambrosejcarr commented 3 years ago

The NCBI recommendation cited by @sidneymbell has a relatively simple set of web tools.

CD8A: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=925 APOD: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=347

Of note, search for the "representative expression" section at the bottom: it has recorded tissues in which expression of the gene has been established.

ambrosejcarr commented 3 years ago

@Alokito mentioned that it would be a good idea for us to enable cellxgene to read from multiple cell databases. For companies, this will enable them to interface their own interface with their own internal metadata repositories. For us, it would facilitate easier swapping between feature namespaces (protein, DNA, transcripts, genes) and ensure cellxgene remains a general tool -- the requirement would be that the database index overlaps with the var index in cellxgene. We could also enable the feature to read from .var metadata as a default.

Munfred commented 3 years ago

Hello, g:Profiler is another source for you to look at: https://biit.cs.ut.ee/gprofiler/gost

It supports all ensembl organisms and already has a python API: https://pypi.org/project/gprofiler-official/ https://biit.cs.ut.ee/gprofiler/page/apis

signechambers1 commented 3 years ago

A good example of protein contextualization here (thanks Jonah and @ambrosejcarr): https://opencell.czbiohub.org/

Hrovatin commented 3 years ago

Is there a way for user to browse var (gene metadata) in CellXGene (e.g. to decide which genes to plot latter on)?

signechambers1 commented 3 years ago

Hi @Hrovatin, there is not a way to browse var in cellxgene. You can see if a gene exists in a dataset using the "Autosuggest gene" functionality in the top right corner which will autocomplete genes from the var index.

chanzuckerberg / cellxgene