Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
http://oshlacklab.com/splatter/
GNU General Public License v3.0
218 stars 57 forks source link

Improvements to biological variation #97

Open lazappi opened 4 years ago

lazappi commented 4 years ago

I observe much higher dispersions in simulated data compared to the underlying real data. As far as I understand, the underlying biological assumption is that each gene has an associated biological variation, which is captured by the negative binomial overdispersion parameter and is a decreasing function of this gene's average expression. This is also why in the splatter paper, you talk about gene-wise dispersion.

That would mean that the proper generative model would be to sample from the dispersion prior, and then scale it by the gene's expression. Expression here is the biological expression.

But how Splatter's model is set up is that the sampled dispersion is scaled by gene expression times library size. I think that this is where the additional dispersion comes from, now also the library size prior affects the NB dispersion parameter.

My question now is: Am I completely misunderstanding how this is supposed to work? If not, I would be up for a pull request to do the library size scaling in a later step.

Originally posted by @ilia-kats in https://github.com/Oshlack/splatter/issues/71#issuecomment-625800536

lazappi commented 4 years ago

Hi @ilia-kats

Thanks for the comment and your thoughts on the problem. Excess variability is something that has been observed in various places so it is a known weakness of the model. I have looked at it a couple of times and been unable to come up with a good fix so anything you can add would be much appreciated.

It sounds like you have a good understanding of this already so the next bit is mostly for me to get my head around it (and for anyone else who follows this conversation).

These are the (somewhat simplified) steps in the current Splat model:

  1. Sample a mean for each gene (m_i) from a Gamma distribution
  2. Sample a library size for each cell (L_j) from a log-normal distribution
  3. Adjust the gene mean for each cell using the library size (m_i,j = L_j * (m_j / sum(m_i))
  4. Add variability to each gene based on the mean expression and sample a new mean (M_i,j)
  5. Sample counts from a Poisson distribution (Y_i,j = Poi(M_i,j))

I think what you are suggesting is that Step 4 should be performed before Step 3 (correct me if I'm wrong)? I think we considered this at the time and decided that it made sense to do the variability adjustment on the library size adjusted mean but I can see how that could lead to excess variability. I think this should be a fairly easy change to make but it would be nice to have some examples to show that it makes a difference.

Would you be willing to help out with that?

ilia-kats commented 4 years ago

Yes, that is exactly what I was saying. However, I have played around with that a little bit in the meantime, and after exchanging steps 3 and 4, the dispersions still don't match the real data. I was able to generate something realistic by additionally adjusting the trend function. So to summarize, what I did (for the data set that I'm working on currently) was:

assays(sim)$BaseCellMeans <- cell.props.gene

in splatSimSingleCellMeans, and

bcv <- (bcv.common + sqrt(1 / (base.means.cell * lib.loc))) * 5e-1 *
            sqrt(bcv.df / rchisq(nGenes, df = bcv.df))
means.cell <- matrix(rgamma(
        as.numeric(nGenes) * as.numeric(nCells),
        shape = 1 / (bcv ^ 2), scale = t(t(base.means.cell ) * exp.lib.sizes) * (bcv ^ 2)),
        nrow = nGenes, ncol = nCells)

in splatSimBCVMeans. After running edgeR on the simulated data and plotting the trended dispersions together with estimates from real data, the curves were almost identical. This is very ad-hoc, I was just trying to understand where the excess dispersion is coming from. I think a proper solution would be to try to estimate the additional multiplication factor using the trended dispersion values from edgeR, but I haven't come up with a good solution for that so far.

lazappi commented 4 years ago

Thanks for the update! I think this is about the point I have got to before. The BCV equation/estimation probably needs to be modified in some way but I wasn't able to come up with anything that generalised to different datasets. If you come up with anything that seems to work I would be very keen to see it!