broadinstitute / Tangram

Spatial alignment of single cell transcriptomic data.
BSD 3-Clause "New" or "Revised" License
249 stars 50 forks source link

scRNAseq cells < spatial cells, curious about how mapping works #108

Closed jayypaul closed 7 months ago

jayypaul commented 10 months ago

Hello,

I have a spatial dataset that is twice as large as the single cell. 60,000 vs 30,000.

Attempting to map the cells crashes for me, but considering that my single cell data has multiple patient samples, I'm applying the clustering approach first which seems like a pseudobulk attempt to handle the batch effects in the single cell data. I understand that I could split the spatial dataset up and do imputation sequentially, which I can give a shot to compare.

So far so good, but I'm still curious about the use case where the algorithm is attempting to map single cells from scRNAseq to spatial, when scRNAseq cells < spatial cells. When it's the opposite, a filter can be applied to find the optimal subset, and mapping occurs with regards to softmax probabilities. Philosophically that makes sense. However, when there are less cells to map to the spatial data, how does the algorithm handle this? Is it capable of mapping the same cell to multiple spots? and transferring the same cells gene expression info to those spots?

In this case, would it be theoretically better to then split the spatial data and impute sequentially? I'm also curious about this idea from the cluster perspective, where I think it's even more relevant, since it's averaging expression across the cluster and then looking to map that info, so in that case, I imagine the number of pseudobulks is always less than the number of spatial spots. and so a given cluster maps to multiple spots in this case as well?

If someone can shed light on how to think about this and how it might affect the results, I'd greatly appreciate it!

Thanks.

gaddamshreya1 commented 8 months ago

Hello @jayypaul, thank you again for your interest in Tangram!!

It super cool that you used the cluster mode to tackle huge datasets! The other way you tried, where you split the spatial data, is a recommended method too; we have used this in the past to map large sections. It would be interesting to see the comparison between mapping using these two methods. Coming to your question, Tangram always maps the same cells to multiple spots and that's the rationale behind clusters mode as well as in that case we would have one cell per cell type.

I hope this answers your question! I'm tagging @Hejin0701 here as his thoughts on this would be great!!

Hejin0701 commented 8 months ago

Hi @jayypaul, thank you for your question. As you and @gaddamshreya1 mentioned, splitting the spatial the method can be a good way to when the scRNAseq cells < spatial cells. However, how to split the spatial data can be non-nontrivial: splitting out a small subsection with the cell type composition vastly different from scRNA-seq can have a negative effect on Tangram mapping.

A potentially better way is to preprocess the scRNA-seq data before mapping. The preprocessing step can be similar to the cluster-level mapping, as mentioned in your question. For example, there are 10000 scRNA-seq cell clustered into 50 cell types. In the Tangram cluster-level mapping code, we don't just feed only 50 averaged cells into the mapping. Instead, the pseudobulking indicates that we are still mapping 10000 cells to space, but now each of the 10000 cell's gene expression is the average gene expression of the cells belonging to the cell type. (for example, if cell type A has 20 cells, you can treat it as we are mapping 20 copies of cell type A to spatial) Thus for scRNAseq cells < spatial cells, you can simply create duplicates the scRNA-seq before mapping. If spatial cells ~= 2* scRNA-seq, just make a duplicate for every cell in the scRNA-seq will work.

jayypaul commented 7 months ago

Hi @Hejin0701 , thanks for the explanation. this is very helpful advice for best use of the algorithm. Hope all is well, thanks everyone!