Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
GNU General Public License v3.0
213 stars 56 forks source link

Increased heterogeneity is only described on first PC #172

Open vali301s opened 1 month ago

vali301s commented 1 month ago

Hi,

thank you for the development of Splatter, it's been very exciting exploring your package so far.

I have been using Splatter to simulate data of one group with varying heterogeneity. To set the heterogeneity levels I am adjusting the BCV parameters (for higher heterogeneity -> increase bcv.common and decrease bcv.df). In the following picture, you can see that in the PCA plots the cells are more dispersed with higher heterogeneity (as expected). However, when I plotted the Elbow plots (below each PCA plot) I noticed that the increase in heterogeneity is mainly comming from the first PC. Screenshot 2024-08-13 105004

This looks super unnatural to me and I have never seen this in real scRNAseq data. Do you know why this is happening? Also, despite this, do you think that I can further use the datasets that I have created, i.e. is it a problem that the Elbow plot looks like this?

PS: Apart from only changing the BCV parameters, I also estimated the parameters from real data: immune cells with low heterogeneity (Naive T/B cells) and high heterogeneity (macrophages/monocytes) and simulated new scRNAseq datasets with said parameters. Once again, I noticed that the additional heterogenetiy that the macrophages and monocytes have is again described mostly by PC1. Since the Elbow plot of the simulated macrophages/monocytes (estimated from real) data looks like this, it really seems that its a feature of Splatter to describe the heterogenetiy on only the first PC... Screenshot 2024-08-13 110415

Thank you very much in advance.

lazappi commented 1 month ago

Hi @vali301s

Thanks for giving {splatter} a go. Modifying the variation in a single population is something that hasn't come up very often and is maybe something that the splat simulation doesn't do very well. As you have seen the bcv parameters have some effect but maybe not what you would like and introducing enough different kinds of variation is something many simulations struggle with. I would be curious to see what this looks like in real data though. If you subset to only similar cells in a real population do you see a similar effect on the PCA?

An alternative approach which has been used previously is to simulate a single path rather than one homogenous group. This gives you access to more parameters which you can manipulate to give you something closer to what you want, for example by reducing the amount of differential expression along the path so that it gives your cells some variation but not enough to create two separate populations.