How to simulate cells from multiple biological samples

eleanorahowe commented 7 years ago

Hi there, I'm writing to ask for your advice on the best way to use the Splatter package to simulate single-cell data from multiple biological samples. I'd like to determine how many tissue samples I need in order to get enough power to detect differential expression between a given cell type taken from two groups of tissues. Splatter seems beautifully designed for simulating cells of a single type or of multiple types, but seems to only be able to simulate them coming from one source of origin. How can I add another layer of variability that comes from the biological differences between animals that a tissue might be isolated from?

lazappi commented 7 years ago

Hi Eleanor

Thanks for giving Splatter a go, I hope it can be useful for your work.

As you have probably gathered when designing the Splat simulation we had in mind a single sample with multiple cell types. As more people more people are giving it a go it is clear that there is a demand for more complex experimental designs.

In the development version we have added an additional layer of "batch" effects. These are designed to represent multiple samples of the same tissue type. I think this would work for your scenario, you can use the batch to represent the two samples and the groups to represent cell types in those samples.

If you want to give it a go you can install the development version using devtools::install_github("Oshlack/splatter"). You should notice that a few of the parameters have been changed to allow for the batches, hopefully the documentation makes it clear but let me know if you have any problems. Please be aware that I will probably be making changes in the next week or so that will make the development version incompatible with the current Bioconductor version (3.5) in preparation for the next release in October. If you want to make sure you install the development version as it is now use devtools::install_github("Oshlack/splatter@v1.1.4").

Hope that helps, good luck with your project

Luke

eleanorahowe commented 7 years ago

Luke, thanks so much for the swift reply. I will definitely check out the dev version - I'm so glad to hear that I'm not alone in my interest in multiple samples, and that you're so on top of what the community needs.

Good luck with development, and I'm sure we'll talk again soon.

lazappi commented 7 years ago

I'm going to close this issue but feel free to comment/reopen/open a new one if you have any more questions.

eleanorahowe commented 7 years ago

Hello there, Luke,

It turns out I do have a question - how do I go about choosing a value for the batch.facLoc and batch.facScale parameters?

I don't think that splatEstimate is trying to estimate those values, which makes sense since it has no way of knowing which of the cells in my training data belong to which tissue samples.

If I need to specify those values manually, how do I calculate them from training data, and if I can get splatEstimate to do the estimation for me, how do I tell splatEstimate about which tissue sample each cell came from?

Eleanor

lazappi commented 7 years ago

Hi Eleanor

You are correct, splatEstimate doesn't know about samples so it currently can't estimate the batch parameters.

I haven't really given much thought about how you would estimate them. Personally I have just adjusted them until a PCA plot looks like what I want, but obviously that isn't very scientific.

The batch effects try to replicate the global difference between samples so the first idea that comes to mind would be to do something like this:

Create a "meta-cell" for each sample by averaging expression for each gene across all the cells (possibly after simple library size normalisation)
Subtract one meta-cell from the other to get the differences (kind of like the batch effect factors in the simulation)
Fit a log-normal distribution to the (absolute) differences

I think that should be a reasonable approach but I haven't tried it yet. If you are able to give it a go and report back that would be great and if it works I will look at including it in splatEstimate. I would be interested in other ideas you might have as well, or if that doesn't make sense I am happy to discuss it further.

eleanorahowe commented 7 years ago

Hi Luke,

I think your suggestion here sounds real reasonable. I'll give it a shot.

I think this would work as-is if I have two samples only, but if I have three or more a couple of questions get raised:

For step 2, there, if I've got more than two samples, I think I'd take the mean of the meta-cells and subtract each meta-cell from that, then use those absolute values in step 3. Sound good?

The for step 3, when the fit is done, I'd just collect all of those differences together in one vector to fit the lognormal distribution to. I wouldn't try to fit different distributions to each set of differences or anything. It seems ok to me, does it seem ok to you?

Eleanor

eleanorahowe commented 7 years ago

Hi Luke,

Another question for you: how do you normally handle estimating a distribution from the data, given that it's so zero-inflated? Do you just add say, 10e-16 to each value or do you perhaps filter out the zeroes?

Eleanor

lazappi commented 7 years ago

Your approach for more than two groups sounds pretty good. That's similar to the way the simulation works, the batches are done by adding the factors to a fictional base cell so the average of the meta-cells should be a reasonable approximation of that.

When fitting means we filter out genes that are all zero, so I think it would be reasonable to remove any differences that are zero. It be worth playing around with a couple of approaches and seeing how they affect the fit. For doing the fitting itself I usually use the functions in the fitdistrplus package.

eleanorahowe commented 7 years ago

Hi Luke,

Right now I'm taking the mean of all the cells from each tissue sample, and subtracting that from the mean of the measurements of all tissue samples. That means I've got a matrix of n x p where n = number of genes and p=number of tissue samples. Next it seems right to combine the differences in each tissue sample into a single average, so that in the end I model a distribution with one measurement per tissue sample. Does this make sense to you?

Eleanor

lazappi commented 7 years ago

Yes, that sounds good to me. I'm not sure if you have thought about how to combine the vectors of differences but there are a couple of ideas that come to mind. One is to average them and the other is to just concatenate them (treat it as one big vector of differences). I think the second approach might work better, but it would need to be checked.

Alternatively both the batch.facLoc and batch.facScale can be supplied as vectors with different values for each "batch". To do that I would fit a log-normal to each column of your n x p matrix of differences. I think this would be a better approach as it some batches to be more different than others.

I hope that answers your question, happy to discuss it more if you need. I think we are heading in the right direction, if it looks like it is working I would be keen to add it to the splatEstimate function.

eleanorahowe commented 7 years ago

Hi Luke,

Ah, I didn't realize those parameters could be specified with a vector! That's really good to know. Very interesting. Thank you for calling that out!

I've got two questions for you about modeling generally:

First question: Do you know enough about the distributions of expression in scRNA-seq vs bulk RNA-seq to know whether we can use the distribution of expression from bulk experiments to generate biological variance parameters to put into Splatter? I'm currently doing a lit search on this but I'm not having much luck. If you've seen something I'd be grateful for the insight.

Second question: as I understand it right now, Splatter models batch effects as a value that is uniformly added or subtracted to all genes in cells from that batch. But in the case of animal-to-animal variation, we would expect that the batch effect to be different on each gene: some genes will be higher in one animal than another, and some lower. Do you have plans to incorporate that kind of variability or should I plan on adding that in after-the-fact?

And I agree, we're getting somewhere here and I'd love to help you get interesting new functionality into the Splatter package.

lazappi commented 7 years ago

First question: My gut feeling is that bulk RNA-seq will be much less variable than scRNA-seq so the estimates might not transfer very well, but I'm can't think of any studies that have looked at that in much detail.

Second question: Each batch has it's own log-normal distribution and then we choose a factor from that distribution for each gene. So the way it works now it is possible for the batch effect to increase the expression of a gene in one sample (batch) and decrease the expression of the same gene in another. So let's say we have two samples and four genes the batch effect factors could look something like:

        Batch1  Batch2
Gene1     0.1     0.5
Gene2     1.2     1.4
Gene3     1.1     0.9
Gene4     0.8     1.3

Where some genes have a positive effect in one batch and negative in another (a factor of 1 would be no effect). These would then be multiply the means for each gene before generating counts. Does that answer you question? I feel like I might have misunderstood what you were getting at?

eleanorahowe commented 7 years ago

Hi Luke,

For the first question, yeah, I think you may be right about that. I'm still going to do some digging but I'm not super optimistic.

For the second question - I think you understand me, and I hope I understand you as well. It makes more sense now that I realize that you can provide Splatter with a logmean and logsd for each sample.

So what I am planning to do is this: I'll fit a lognormal distribution to the mean gene counts of each of my input samples. I have only five samples, so I'm fitting five distributions. Now, I could easily just put those five logmean/logsd values into Splatter as my biological variance numbers and generate another five simulated batches, and I bet that would work great. However, the whole point of this exercise for me is to simulate more samples, and figure out how many samples I really need to do the differential expression tests I want.

So what I need to do is create more logmean/logsd values for more batches. To do that I think I want to fit a distribution to the five logmean and five logsd values I got from my input data. Then from that distribution I can create as many logmean/logsd pairs I want to give to Splatter and make as many batches as I want. Does that sound right? If I fit a distribution on those logmean/logsd values, do I expect that to also be a lognormal distribution?

lazappi commented 7 years ago

That sounds like a reasonable approach to generate more values. I'm not sure how those values should be distributed, I think that's reaching the limit of my statistical knowledge. The problem I can see is fitting any distribution with only five point. Perhaps an alternative would be to use the values you have will some random perturbation? So something like new_mean = sample(estimated_means) + N(0, 1).

eleanorahowe commented 7 years ago

I think you're right - the low number of samples is going to be a challenge. I'm going to be looking around for other datasets I can use that might have more of them. We'll see!

lazappi commented 6 years ago

Hi @eleanorahowe. I know this was a long time ago now but just wondering if you had any luck with a method for estimating the batch effects. The deadline for updates before the next Bioconductor release is coming up pretty soon and if you have something that works I'd be keen to incorporate it into Splatter.

eleanorahowe commented 6 years ago

Hi Luke,

No, unfortunately, we weren't able to come up with a method we liked. Thank you for asking, and good luck with the new release!

Eleanor

On Mon, Apr 9, 2018, 2:16 AM Luke Zappia notifications@github.com wrote:

Hi @eleanorahowe https://github.com/eleanorahowe. I know this was a long time ago now but just wondering if you had any luck with a method for estimating the batch effects. The deadline for updates before the next Bioconductor release is coming up pretty soon and if you have something that works I'd be keen to incorporate it into Splatter.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Oshlack/splatter/issues/22#issuecomment-379645058, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzaTUdlfvIaoczlW4tDbDJloOoxuyH0ks5tmvy4gaJpZM4PQmk2 .

lazappi commented 6 years ago

Ok, thanks for getting back to me. I'm going to close this now but feel free to get back in touch if you have any other questions/issues with Splatter.

Oshlack / splatter

How to simulate cells from multiple biological samples #22