chanzuckerberg / cellxgene

An interactive explorer for single-cell transcriptomics data
https://chanzuckerberg.github.io/cellxgene/
MIT License
615 stars 116 forks source link

Log/Linear y-axis scale toggle for right sidebar gene expression #1211

Open ambrosejcarr opened 4 years ago

ambrosejcarr commented 4 years ago

After differential expression has been calculated, cellxgene displays histograms to visualize the distribution of a selected gene across all cells on the right sidebar.

However, many genes are sparsely observed across all but the largest expressing populations, producing a very large "zero component" that scales with the size of the dataset. Adding the ability to toggle between log and linear scale for gene expression would enable categorical gene expression differences to be more readily observed across different populations.

The following is an example of an extremely strong differential expression effect in a very well contained cluster that is almost invisible in the UI due to the large size of the dataset producing a very large "zero" component that swamps the large expression in the small cluster being compared (see the distribution at x=6).

image

By contrast, the linear distribution works well in the left sidebar, at least in this instance:

image

colinmegill commented 4 years ago

Thanks @ambrosejcarr, I think further, we should consider whether this could reasonably replace clipping.

ambrosejcarr commented 4 years ago

There are also cases where the histograms do not work, which makes me want to be able to adjust my xmin/xmax:

image

vals commented 4 years ago

Hi everyone,

Including log scaling of any quantitative scale in cellxgene would be immensely helpful. I think a lot of thing that are easy to make clear with custom plots are hard to see on the linear scale used in cellxgene visualizations.

Let me illustrate with some examples on a data set I have been working with recently. (It's not super important what these cells are I think.)

Not discussed in this issue is log scaling of color scales. A very good thing to check is your count depth, the total number of UMI's per cell.

Here is a tSNE colored by total counts: image And here is the same plot on a log scale for the color: image In this case, I think it is much easier to see that the cluster to the right is a set of low-quality cells.

Similar when coloring by gene expression, here I am coloring the data by the proliferation marker Mki67. First on a linear scale: image Then on a log scale (for the color): image Again here I think it is easier to see which cells are cycling and which are not.

Keeping with the example of Mki67, let's look at a histogram, like is discussed above. Here is a histogram of the linear counts: image And here is a log-scale of the x-axis: image (This example would be more clear when showing the histogram broken up by for example different clusters.)

As another example, let's look at the histogram of total_count. Linear: image Log: image

As a summary, I think it would be useful to put in scaling switches between linear scale and log scale for any quantitative axis and color scales.

colinmegill commented 4 years ago

cc @sidneymbell: I believe we've concluded that we have not implemented this, and may not ever be able to, because we don't know what values may have been sent to the client as log already (in which case, the affordances would offering the user the option to log values that were already log)

ambrosejcarr commented 4 years ago

I see a few ways around this. 

First, the client could consider the data that it receives as f(x) and offer to take the log. As you say, if f(x) = log(x), then the client would create log(log(x)). I don't think there's anything wrong with that. However, it would have the disadvantage of not enabling users to convert their log data back to linear scale. I'm not clear if there's a need for the backwards conversion. 

Second, we could add a checkable box to the UI that says "data are log scale". It could either be unchecked, or we could apply some simple heuristics to guess at whether it should be checked or not by doing a rough goodness of fit test. Better fit to Gaussian distribution? Check the box. Better fit to an exponential distribution? Leave it empty. And the user can always correct us if we get it wrong. That gives us our answer, and we can apply log(x) or e^x in the background

@vals what do you think about these options? See any problems? Can you think of a better solution?

vals commented 4 years ago

My opinion would be to trust the user to know what they are looking at. I'm not aware of any plotting/visualization software that prevents using log scales because the user might have log scaled the values already. Even Excel allows the user to specify log scaled axes. If the aim is to be this dogmatic about inputs and units you should not allow anything except counts as input and enforce very specific data transformations.

The only way I can see anything using the (quantitative) coloring in cellxgene right now is by moving the clipping thing back and forth and looking at which cells are blinking. I guess a lot of people store log scaled values in the adata files. I would prefer not doing this though. Since I want to keep the original counts for statistical analysis and future reproducibility I would need to store twice the data only to save the work of applying a log on the fly: one of the cheapest and simplest operations a computer can do.

It has happened when I've made plots that I accidentally specify scale_y_log10() at the same time as y=log(variable) (making y=log(log(variable)), which is pretty useless). But I don't think that is a larger problem than e.g. accidentally plotting the wrong variable from a typo or so.

brianraymor commented 3 years ago

Per PM triage, moving to Product Backlog for prioritization.

inodb commented 2 years ago

@ambrosejcarr @brianraymor This feature would be great! I've gotten it requested a couple times demo'ing cellxgene for exploring HTAN datasets (https://cellxgene.cziscience.com/collections/62e8f058-9c37-48bc-9200-e767f318a8ec). Is there still a chance this will be implemented?

signechambers1 commented 2 years ago

Hi @inodb! Great to hear this would be helpful. We have not prioritized this feature yet but I will take your feedback into consideration!