Using stdeconvolve with normalized / integrated data

Acaro12 commented 1 year ago

Dear Brendan,

I am using stdeconvolve with an 8-sample visium integrated seurat object. All samples were individually normalized with seurat's sctransform algorithm before anchor-based integration (also seurat toolkit).

The data output from sctransform (and consequently also after integration) can be negative and is of data format double. Hence, it cannot be used with stdeconvolve.

I have two questions: 1) why are non-negative integer required for stdeconvolve? 2) could you think of a way to transform the data in accordance with the algorithms requirements? Would a simple as.integer() + x be a valid way to do this?

Thank you so much in advance for your time! Best, Christoph

bmill3r commented 1 year ago

Hi @Acaro12,

Thanks so much for using STdeconvolve and for your questions!

The reason why STdeconvolve requires non-negative integers basically boils down to the fact that latent Dirichlet allocation requires frequency counts of words or terms, specified as a matrix of nonnegative integers. In this case, our terms are genes, but the same idea holds.

With respect to combining multiple datasets, you could follow our strategy when analyzing the 4 breast cancer sections. Essentially, we take the union of overdispersed genes determined for each of the sections separately, then fit LDA models on the merged dataset, which is all the spots and the combined set of overdispersed genes. Note that in this case, all of the sections were taken from the same biopsy and so it is reasonable to assume that the technical variation between them should be low. If the sections are from different samples, then it might be more appropriate to analyze each separately. We have done this on datasets generated from different samples from the same tissue type (mouse olfactory bulb) and we have found high concordance between the deconvolved cell types (see Supplementary Figure S7). So although each sample is processed separately, STdeconvolve will likely find similar cell types if their gene expression profiles are distinct in the different datasets.

Hope this helps and let me know if you have any other questions, Brendan

joachimsiaw commented 1 year ago

Hi, I am also using STdeconvolve and i find it very interesting. Thanks for the great work on this tool. I have a question though, regarding normalisation of gene expression across the spots. So in seurat, it is recommended to first normalize the data in order to account for variance in sequencing depth across spots/data points. It known that for instance in 10X Visium, variance in molecular counts per spot can be substantial, particularly if there are differences in cell density across the tissue. I haven't seen any of such normalisation in the STdeconvolve pipeline. Could you please explain to me how the concern of variance in molecular counts or uneven cell density across the tissue is accounted for in your pipeline or why it may not be needed?

How do you think such variance could potentially affect the gene expression profiles of the STdeconvolve resolved topics or cell types? Thank you in advance.

bmill3r commented 1 year ago

Hi @joachimsiaw

Thanks for your question!

Essentially, the total counts per spot are treated as independent from all the other data generating variables in the LDA model. Therefore there is no need to depth normalize the total counts in each spot like there is for scRNA-seq data, for example. Additionally, LDA requires frequency counts of words or terms, specified as a matrix of nonnegative integers and so transforming the values to non-integers would be incompatible.

We do however preprocess the data to remove poorly captured genes and low quality spots. We also feature select for overdispersed genes across spots as proxy for cell type specific gene expression. It's possible that large variations in cell density and thus total gene counts in spots could affect the genes that are detected as being overdispersed.

I'll also add that we tested STdeconvolve on the same simulated dataset using different spot sizes (thus varying the cell density range from 1-20+ cells) and observed that the accuracy was stable across spot resolutions. So it seems that cell density does have a major effect in the deconvolution as long as cell type specific groups of co-occurring genes were captured efficiently.

Hope this helps, Brendan

joachimsiaw commented 1 year ago

@bmill3r Thank you for your quick and insightful response. It is clear to be me now why normalization is not need for the the deconvolution.

Can you comment on how the gene expression profiles of the topics are generated?

For eg. , for each gene, does the expression value represent a mean or median expression across all spots?
And if so, how do you think varying cell density could influence this?
If the expressions values are means or medians, dont you think this could lead to lost of spatial-constrained cell-cell communication information?? I want to use the tool MERINGUE from your group to perform Spatially-informed transcriptional clustering and i was wandering how this could be possible, in light of my question in 1 above. Thank you in advance.

bmill3r commented 1 year ago

Hi @joachimsiaw

The gene expression profiles of the deconvolved cell types are essentially probability distributions of each deconvolved cell type over the genes (and not means or medians). In the context of STdeconvolve, they can be thought of as the probability of a gene being expressed by a given cell type. I would recommend checking out some background on LDA for more information.

Hope this answers your question, Brendan

JEFworks commented 1 year ago

Hi everyone,

This blog post and accompanying video walking through a simulation-based approach for exploring why we don't need normalization with STdeconvolve may be useful for you as you explore these interesting questions in the context of your own research pursuits: https://jef.works/blog/2023/05/04/normalization-clustering-vs-deconvolution/

Hope it helps, Jean

JEFworks-Lab / STdeconvolve

Using stdeconvolve with normalized / integrated data #25