chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
596 stars 112 forks source link

Make gene functions discoverable #96

Open sidneymbell opened 6 years ago

sidneymbell commented 6 years ago

I think it's important to contextualize diffexp results by what these genes do, biologically. There are multiple "levels" at which we could support this:
1) Just link out to wikigene / human protein atlas / gene ontologies as Ben suggests below.
2) Group differentially expressed genes by ontology classification/tags, make this an option to color by (e.g., I find that cell set 1 is expressing high levels of a bunch of genes that are involved in "lipid metabolism". Color all cells by their mean expression of genes with this GO tag.)
3) Support GO enrichment analysis in-app

From my notes:

What do these genes do (e.g., gene ontologies)?

From Ben Humphrey's notes:

It would be really helpful on the Expression tab if you could include a link on the gene name that would open a new window into something about that gene - whether WikiGene, or Human Protain Atlas for that gene (my preference), or some other source of info - so I don’t need to just google it myself.

sidneymbell commented 6 years ago

A related comment from Ben:

Another idea - would be SUPER COOL if I could select two cell types that I know are adjecent to one another in the kidney, then click a button such that one cluster then shows all receptors it expresses, the other cluster all the ligands it expresses - so I could get at intercellular communication…that would be amazing…

In a practical sense, this could be very similar to the above idea for color-by-function (both involve filtering genes by function / GO classifications)

sidneymbell commented 5 years ago

This was raised as a desired feature by the most recent feedback session w/ Ben Humphreys' lab, led by @neuromusic w/ @fionagriffin. I think it's worth revisiting.

The suggested GO annotations have a REST API, available here. I believe this would satisfy the requirements for an API that @colinmegill has been enthusiastic about finding / searching for?

One option for a tooltip as suggested by Colin could look like this: gene-ontologies

neuromusic commented 5 years ago

another API discovered w/ @colinmegill this afternoon: mygene.info by @andrewsu's group

the API query would involve (at least) two steps:

  1. query for gene name to get entrez name
  2. get aggregated gene info from entrez name

querying mygene.info for each gene in the the tabula muris h5ad file resulted in...

  1. 92% of all genes yielding an entrez id
  2. 60% of all genes yielding a refseq summary through the API

example using the python wrapper for the API: https://gist.github.com/neuromusic/6ab7769c2030eec573b61b03a8021620

andrewsu commented 5 years ago

A few quick notes...

Further questions and feedback are of course always welcome!

cornhundred commented 4 years ago

This is similar to what we're doing with Clustergrammer-JS's and Clustergrammer2's biology specific features. Mousing over a gene row looks up the gene name and refseq via the Harmonizome. Similarly, enrichment analysis is done via Enrichr. We have back-end (Python) and front-end (JavaScript) implementations Enrichr.

Let us know if that sounds like what you would like to implement and if we can help.

sidneymbell commented 4 years ago

Hi @cornhundred -- thank you so much for the suggestion!

We'll have to look into whether their license is compatible with ours, but I super appreciate the pointer! It looks like a great resource (and Clustergrammer looks like a cool tool :).

cornhundred commented 4 years ago

Hi @sidneymbell, feel free to contact the Ma'ayan lab about their licenses (I'm pretty sure they're permissive), Harmonizome-license.

We're glad you like Clustergrammer! The Clustergrammer2 widget we are working on has a lot of similarities with cellxgene: we're using regl, Python back-end, built for single cell gene expression data. Feel free to check out the Clustergrammer2-notebooks repo: https://github.com/ismms-himc/clustergrammer2-notebooks for some example workflows (see video below):

2,700 PBMC scRNA-seq

We would love feedback and I'm sure we will reach out to you all about cross-tool compatibility, etc. in the future :)

sidneymbell commented 4 years ago

👍👍

On Mon, Jul 22, 2019 at 5:12 PM Nicolas Fernandez notifications@github.com wrote:

Hi @sidneymbell https://github.com/sidneymbell, feel free to contact the Ma'ayan lab about their licenses (I'm pretty sure they're permissive), Harmonizome-license https://github.com/MaayanLab/harmonizome/blob/master/LICENSE.

We're glad you like Clustergrammer! The Clustergrammer2 widget we are working on has a lot of similarities with cellxgene: we're using regl, Python back-end, built for single cell gene expression data. Feel free to check out the Clustergrammer2-notebooks repo: https://github.com/ismms-himc/clustergrammer2-notebooks for some example workflows (see video below):

[image: 2,700 PBMC scRNA-seq] http://www.youtube.com/watch?v=BEPspcC7vIY

We would love feedback and I'm sure we will reach out to you all about cross-tool compatibility, etc. in the future :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene/issues/96?email_source=notifications&email_token=ADAIYX6GLHIVARO7T3Q2MMTQAXE57A5CNFSM4FGKCEH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2QHZGY#issuecomment-513834139, or mute the thread https://github.com/notifications/unsubscribe-auth/ADAIYX26E2BKCBJ6MN4IHBLQAXE57ANCNFSM4FGKCEHQ .

sidneymbell commented 4 years ago

Another option that was suggested today by the GO folks: https://github.com/biolink/ontobio

I haven't looked into this extensively, but it's got a permissive license (BSD-3)

sidneymbell commented 4 years ago

We want to make gene functions discoverable from within the app by pulling in data from public databases.

Implementation:

Data source options

Initial landscaping surfaced quite a few options for data sources. I’ve highlighted some of the most appealing options below with pros/cons, but there are probably also other good options out there. (See appendix for a list of options I don’t think are a good fit.)

Recommendation

NCBI gene database: entrez API URL: https://www.ncbi.nlm.nih.gov/gene About: https://www.ncbi.nlm.nih.gov/books/NBK25501/ License: https://www.ncbi.nlm.nih.gov/home/about/policies/ Pros: Direct access to a wide range of frequently-updated descriptive information of gene function in many species Cons: I haven’t yet found a set of JS-based wrapper functions, although the Python API is quite robust

Humanbase URL: https://hb.flatironinstitute.org/api/ About: https://hb.flatironinstitute.org/about License: CC-BY 4.0 (per direct communication, in process of adding to docs) Diligence in progress: compatible licensing and methods validation Pros: Surfaces interacting genes, functional processes, and tissue-specific expression I would imagine that support from the flatiron institute is pretty stable? Cons: License is not yet publicly documented on their site

Other sources I considered

Gene Ontology Consortium: AmiGO (GOlr) URL: http://wiki.geneontology.org/index.php/AmiGO_2_Manual:_JavaScript About: https://link.springer.com/protocol/10.1007/978-1-4939-3743-1_11 License: Creative Commons Attribution 4.0 Unported License Pros Direct access to the most up-to-date gene ontologies. Offers API for on-demand queries OR direct download of ontologies file that could be packaged into each release (~8MB; advantage is that this would not require an internet connection or sending information outside of the app). Cons Only pulls from the GO consortium / doesn’t offer any additional information directly API appears somewhat confusing

Mygene.info URL: https://mygene.info/ About: https://mygene.info/about License: Apache 2.0 Pros Weekly updated gene ontologies access API is RESTful and documentation is good Cons Only pulls from the GO consortium / doesn’t offer any additional information directly Unclear how stable the source is

Harmonizome URL: https://amp.pharm.mssm.edu/Harmonizome/gene/BRCA1 Pros: Nice visual display of most of the information present in the NCBI gene database + a few others Cons: Doesn’t offer a huge amount of additional information compared to NCBI, and adds another layer of dependency

GeneNetwork.nl URL: https://www.genenetwork.nl/faq All associations putatively based on co-regulation in bulk RNAseq

Other resources with a different use-case Reverse search (function → genes) https://amp.pharm.mssm.edu/geneshot/api.html Mendelian disease focus: https://www.omim.org/about Commercial: Gene Cards Gene set enrichment: webgestalt.org, geneweaver, DiVenn

neuromusic commented 4 years ago

re: mygene.info

Only pulls from the GO consortium / doesn’t offer any additional information directly

Doesn't this service aggregate a bunch of data sources? https://docs.mygene.info/en/latest/doc/data.html

Was this a mis-copy from the "Gene Ontology Consortium: AmiGO (GOlr)" entry above?

andrewsu commented 4 years ago

Just a bit more info on mygene.info in case it's useful:

Clearly I'm biased, but seems like you've got several good options for your use case here!

sidneymbell commented 4 years ago

@neuromusic -- yes, that was a copy/paste error, thanks for catching :) @andrewsu -- thanks for sharing! Mygene.info sounds like an awesome tool. I think in this case, we can get the data we need directly from entrez without needing an additional dependency. I'll definitely keep mygene.info in mind if that changes, though!

colinmegill commented 3 years ago

@sidneymbell @neuromusic @ambrosejcarr it seems as if someone has done the thing:

http://amp.pharm.mssm.edu/Harmonizome/api/1.0/gene/apod

{"symbol":"APOD","synonyms":[],"name":"apolipoprotein D","description":"This gene encodes a component of high density lipoprotein that has no marked similarity to other apolipoprotein sequences. It has a high degree of homology to plasma retinol-binding protein and other members of the alpha 2 microglobulin protein superfamily of carrier proteins, also known as lipocalins. This glycoprotein is closely associated with the enzyme lecithin:cholesterol acyltransferase - an enzyme involved in lipoprotein metabolism. [provided by RefSeq, Aug 2008]","ncbiEntrezGeneId":347,"ncbiEntrezGeneUrl":"http://www.ncbi.nlm.nih.gov/gene/347","proteins":[{"symbol":"APOD_HUMAN","href":"/api/1.0/protein/APOD_HUMAN"}],"hgncRootFamilies":[{"name":"Calycin structural superfamily","href":"/api/1.0/gene_family/Calycin+structural+superfamily"},{"name":"Apolipoproteins (APO)","href":"/api/1.0/gene_family/Apolipoproteins+%28APO%29"}]}
cornhundred commented 3 years ago

gene_info

@colinmegill @sidneymbell @neuromusic @ambrosejcarr Yes, when we were building the Harmonizome at the Ma'ayan lab we made sure to make it CORS compatible (https://clustergrammer.readthedocs.io/biology_specific_features.html#mouseover-gene-name-and-description).

We have this example on ObservableHQ (https://observablehq.com/@ismms-himc/covid-19-transcriptional-signature-tenoever-data-a549?collection=@ismms-himc/ismms-himc-covid-19) that shows you can talk to Enrichr (for enrichment analyssis) and Harmonizome via Clustergrammer-GL and some REST get requests.

colinmegill commented 3 years ago

I do apologize for not realizing this was JSON, in the thread above :)

ambrosejcarr commented 3 years ago

The NCBI recommendation cited by @sidneymbell has a relatively simple set of web tools.

CD8A: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=925 APOD: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=347

Of note, search for the "representative expression" section at the bottom: it has recorded tissues in which expression of the gene has been established.

ambrosejcarr commented 3 years ago

@Alokito mentioned that it would be a good idea for us to enable cellxgene to read from multiple cell databases. For companies, this will enable them to interface their own interface with their own internal metadata repositories. For us, it would facilitate easier swapping between feature namespaces (protein, DNA, transcripts, genes) and ensure cellxgene remains a general tool -- the requirement would be that the database index overlaps with the var index in cellxgene. We could also enable the feature to read from .var metadata as a default.

Munfred commented 3 years ago

Hello, g:Profiler is another source for you to look at: https://biit.cs.ut.ee/gprofiler/gost

It supports all ensembl organisms and already has a python API: https://pypi.org/project/gprofiler-official/ https://biit.cs.ut.ee/gprofiler/page/apis

signechambers1 commented 3 years ago

A good example of protein contextualization here (thanks Jonah and @ambrosejcarr): https://opencell.czbiohub.org/ image

Hrovatin commented 3 years ago

Is there a way for user to browse var (gene metadata) in CellXGene (e.g. to decide which genes to plot latter on)?

signechambers1 commented 3 years ago

Hi @Hrovatin, there is not a way to browse var in cellxgene. You can see if a gene exists in a dataset using the "Autosuggest gene" functionality in the top right corner which will autocomplete genes from the var index.