Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
http://oshlacklab.com/splatter/
GNU General Public License v3.0
215 stars 57 forks source link

How to simulate cell state heterogeneity too #117

Closed alexandruioanvoda closed 3 years ago

alexandruioanvoda commented 3 years ago

Hi! Your package is great! I really hope I can use it for my purposes too.

In the simulations I'd be looking to run, cells are attributed to a cell-type group, but they can also have true gene expression heterogeneity (due to biology, not due to mixed factors like library size) within said group as well! From my initial skimming of the docs and examples, it doesn't seem like splatter does this by default.

> var(colSums(sim2@assays@data@listData[["BatchCellMeans"]]))
[1] 0
> var(colSums(sim2@assays@data@listData[["BaseCellMeans"]]))
[1] 164078003
> var(colSums(sim2@assays@data@listData[["CellMeans"]]))
[1] 165348343

From my understanding, BaseCellMeans have differences in gene expression between cells only because of:

  1. The cell group
  2. The library size

Is there any recommended way to simulate some heterogeneity between single cells in a group (due to say, some being slightly more hypoxic than others, but still remaining in the same group, same markers, etc.), and to have the underlying ground truth on hand?

lazappi commented 3 years ago

Hi @alexandruioanvoda

Thanks for giving splatter a go 🎉! I think your intuition of what is going on is pretty close. You are correct that the Splat simulation has concepts of cell "group" (type) and "batch" but not really "condition". This is something I'm hoping to add but it's probably still a while off.

Depending on your use case you might be able to use "batch" as your condition, but this is a fairly random effect. I would also suggest having a look at the SplatPop simulation https://www.bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatPop.html. This allows for a more complex condition structure and @azodichr might be able to help with suggesting parameters for your situation.

Hope that helps!

alexandruioanvoda commented 3 years ago

Dear @lazappi ,

Thank you for your reply!

On this question of continuous variance between cells within a group, I was wondering: There are several matrices in the splatter package's outputs, namely

[1] "BatchCellMeans" "BaseCellMeans"  "BCV"            "CellMeans"      "TrueCounts"     "DropProb"      
[7] "Dropout"        "counts"         "logcounts"

BatchCellMeans should only have 1 column (cell-type), repeated for ncol=number_of_simulated_cells (if there's no batch-effect simulation, of course). BaseCellMeans, on the other hand, does have mostly unique gene expressions for each cell individually. However, all the variation present there, is simulated into splatter only through library size differences between cells in a group, right? Arguably, biological variance would exist between cells in a group regardless of library size, and not necessarily due to a condition (e.g. stimulation with some cytokine), but also due to temporal variability (i.e. some cycling genes) and to other essentially random factors too.

Would there be a way to simulate such small deviations of true expression (independent of library size) in the BaseCellMeans without going for specific conditions (e.g. some binary treatment), but rather going for an additional rnorm factor for each gene within the cell group (or some other distribution that we may expect)?

alexandruioanvoda commented 3 years ago

Sorry for adding a question on top of this, although it is related:

Where are the ground truth expressions (before adjustments for different library sizes, so not BaseCellMeans) stored for the cells in a path simulation?

What I mean by this is whether there's a way to get the cell.props.gene matrix from this splatter function: https://github.com/Oshlack/splatter/blob/bef776a105d49621545d237cf671f4de990d8b0d/R/splat-simulate.R#L530

And if we'd get it, would it show variation in gene expression between cells in the same group?

lazappi commented 3 years ago

BatchCellMeans should only have 1 column (cell-type), repeated for ncol=number_of_simulated_cells (if there's no batch-effect simulation, of course).

BatchCellMeans is the base gene mean multiplied by the batch factor for each cell. If there is only one batch all the factors are 1 so it will just be the gene mean repeated for each cell.

BaseCellMeans, on the other hand, does have mostly unique gene expressions for each cell individually. However, all the variation present there, is simulated into splatter only through library size differences between cells in a group, right? Arguably, biological variance would exist between cells in a group regardless of library size, and not necessarily due to a condition (e.g. stimulation with some cytokine), but also due to temporal variability (i.e. some cycling genes) and to other essentially random factors too.

BaseCellMeans is BatchCellMeans * DEFactors * LibrarySizeFactors for each cell. The DEFactors come from cell group or path (depending on what has been simulated). So it incorporates variability from batch, cell state and library size.

Would there be a way to simulate such small deviations of true expression (independent of library size) in the BaseCellMeans without going for specific conditions (e.g. some binary treatment), but rather going for an additional rnorm factor for each gene within the cell group (or some other distribution that we may expect)?

I would argue that this is what the BCV step is designed to do this, where the means for each cell are resampled with additional variability. This should mean that even if two cells have the same batch, group, library size etc. they will have different gene means before the final sampling process.

If you did want to manually add more variability for whatever reason it should be possible to do it by:

  1. Running a simulation as normal
  2. Modifying BaseCellMeans (or whichever stage you want)
  3. Running the simulation stages after that by calling the internal splat simulation function using :::

I'm open to arguments if you still think this doesn't cover what might be happening. Cell cycle is a special case which would need to be treated differently I think, but open to that as well.

Where are the ground truth expressions (before adjustments for different library sizes, so not BaseCellMeans) stored for the cells in a path simulation?

What I mean by this is whether there's a way to get the cell.props.gene matrix from this splatter function:

https://github.com/Oshlack/splatter/blob/bef776a105d49621545d237cf671f4de990d8b0d/R/splat-simulate.R#L530

And if we'd get it, would it show variation in gene expression between cells in the same group?

The line you pointed to is in the splatSimGroupCellMeans() so won't be used in paths simulation but there is an equivalent line in splatSimPathCellMeans().

These three lines do the library size adjustment:

https://github.com/Oshlack/splatter/blob/bef776a105d49621545d237cf671f4de990d8b0d/R/splat-simulate.R#L623-L625

You can see where some of the variables come from at the start of the function:

https://github.com/Oshlack/splatter/blob/bef776a105d49621545d237cf671f4de990d8b0d/R/splat-simulate.R#L556-L557

Undoing these steps would get you back to cell.means.gene, which is the cell means matrix after batch effects and path/group effects but before library size adjustment. Possibly it would be helpful to have this or cell.facs.gene stored in the object but I think I was trying to limit the number of saved matrices so skipped this one. The BCV step is later though so each cell from the same batch and group/(path + step) would still have the same means at this stage (probably why I skipped storing the matrix, ok for groups but probably not for paths).

Hopefully that helps. Happy to clarify further if that's still not clear 😸.

alexandruioanvoda commented 3 years ago

Thanks for the help! Very useful!