Closed alexandruioanvoda closed 3 years ago
Hi @alexandruioanvoda
Thanks for giving splatter a go 🎉! I think your intuition of what is going on is pretty close. You are correct that the Splat simulation has concepts of cell "group" (type) and "batch" but not really "condition". This is something I'm hoping to add but it's probably still a while off.
Depending on your use case you might be able to use "batch" as your condition, but this is a fairly random effect. I would also suggest having a look at the SplatPop simulation https://www.bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatPop.html. This allows for a more complex condition structure and @azodichr might be able to help with suggesting parameters for your situation.
Hope that helps!
Dear @lazappi ,
Thank you for your reply!
On this question of continuous variance between cells within a group, I was wondering: There are several matrices in the splatter package's outputs, namely
[1] "BatchCellMeans" "BaseCellMeans" "BCV" "CellMeans" "TrueCounts" "DropProb"
[7] "Dropout" "counts" "logcounts"
BatchCellMeans
should only have 1 column (cell-type), repeated for ncol=number_of_simulated_cells (if there's no batch-effect simulation, of course).
BaseCellMeans
, on the other hand, does have mostly unique gene expressions for each cell individually. However, all the variation present there, is simulated into splatter only through library size differences between cells in a group, right? Arguably, biological variance would exist between cells in a group regardless of library size, and not necessarily due to a condition (e.g. stimulation with some cytokine), but also due to temporal variability (i.e. some cycling genes) and to other essentially random factors too.
Would there be a way to simulate such small deviations of true expression (independent of library size) in the BaseCellMeans
without going for specific conditions (e.g. some binary treatment), but rather going for an additional rnorm
factor for each gene within the cell group (or some other distribution that we may expect)?
Sorry for adding a question on top of this, although it is related:
Where are the ground truth expressions (before adjustments for different library sizes, so not BaseCellMeans
) stored for the cells in a path simulation?
What I mean by this is whether there's a way to get the cell.props.gene
matrix from this splatter function: https://github.com/Oshlack/splatter/blob/bef776a105d49621545d237cf671f4de990d8b0d/R/splat-simulate.R#L530
And if we'd get it, would it show variation in gene expression between cells in the same group?
BatchCellMeans
should only have 1 column (cell-type), repeated for ncol=number_of_simulated_cells (if there's no batch-effect simulation, of course).
BatchCellMeans
is the base gene mean multiplied by the batch factor for each cell. If there is only one batch all the factors are 1 so it will just be the gene mean repeated for each cell.
BaseCellMeans
, on the other hand, does have mostly unique gene expressions for each cell individually. However, all the variation present there, is simulated into splatter only through library size differences between cells in a group, right? Arguably, biological variance would exist between cells in a group regardless of library size, and not necessarily due to a condition (e.g. stimulation with some cytokine), but also due to temporal variability (i.e. some cycling genes) and to other essentially random factors too.
BaseCellMeans
is BatchCellMeans * DEFactors * LibrarySizeFactors
for each cell. The DEFactors
come from cell group or path (depending on what has been simulated). So it incorporates variability from batch, cell state and library size.
Would there be a way to simulate such small deviations of true expression (independent of library size) in the
BaseCellMeans
without going for specific conditions (e.g. some binary treatment), but rather going for an additionalrnorm
factor for each gene within the cell group (or some other distribution that we may expect)?
I would argue that this is what the BCV step is designed to do this, where the means for each cell are resampled with additional variability. This should mean that even if two cells have the same batch, group, library size etc. they will have different gene means before the final sampling process.
If you did want to manually add more variability for whatever reason it should be possible to do it by:
BaseCellMeans
(or whichever stage you want):::
I'm open to arguments if you still think this doesn't cover what might be happening. Cell cycle is a special case which would need to be treated differently I think, but open to that as well.
Where are the ground truth expressions (before adjustments for different library sizes, so not
BaseCellMeans
) stored for the cells in a path simulation?What I mean by this is whether there's a way to get the
cell.props.gene
matrix from this splatter function:And if we'd get it, would it show variation in gene expression between cells in the same group?
The line you pointed to is in the splatSimGroupCellMeans()
so won't be used in paths simulation but there is an equivalent line in splatSimPathCellMeans()
.
These three lines do the library size adjustment:
You can see where some of the variables come from at the start of the function:
Undoing these steps would get you back to cell.means.gene
, which is the cell means matrix after batch effects and path/group effects but before library size adjustment. Possibly it would be helpful to have this or cell.facs.gene
stored in the object but I think I was trying to limit the number of saved matrices so skipped this one. The BCV step is later though so each cell from the same batch and group/(path + step) would still have the same means at this stage (probably why I skipped storing the matrix, ok for groups but probably not for paths).
Hopefully that helps. Happy to clarify further if that's still not clear 😸.
Thanks for the help! Very useful!
Hi! Your package is great! I really hope I can use it for my purposes too.
In the simulations I'd be looking to run, cells are attributed to a cell-type group, but they can also have true gene expression heterogeneity (due to biology, not due to mixed factors like library size) within said group as well! From my initial skimming of the docs and examples, it doesn't seem like splatter does this by default.
From my understanding,
BaseCellMeans
have differences in gene expression between cells only because of:Is there any recommended way to simulate some heterogeneity between single cells in a group (due to say, some being slightly more hypoxic than others, but still remaining in the same group, same markers, etc.), and to have the underlying ground truth on hand?