normalization with 5000 or 10000 filtered genes

abasseville commented 6 years ago

Hi, Thank you very much for your package and your tutorial. It is very useful, and very well explained. I have a question about the normalization step in Zinbwave function.

I followed your tutorial with the public data you used from https://www.bioconductor.org/help/course-materials/2017/BioC2017/Day2/Workshops/singleCell/doc/workshop.html, but I tried it with different filtered genes number for core (1000, 5000, 10000, 15000), then I applied the following line for each:

print(system.time(se <- zinbwave(core, K = 5, X = "~ Batch",
residuals = TRUE, normalizedValues = TRUE)))

When I looked at boxplots of normalized expression measures (deviance residuals) color-coded by batch, I saw differences between batches when the number of filtered gene is superior to 1000 (see attached pdf).

I also tried with my own samples and I saw exactly the same thing when I changed filtered gene number. The gene signature analysis that I performed after was clearly bias by the normalization when I used 10000 filtered genes.

It is the first time I am doing single cell analysis so I don't know if it's usual to have this “issue” with normalization or if there is something that I have forgotten to do.

Thanks

normalization_zinbwave2.pdf

drisso commented 6 years ago

Hi,

the "normalized values" that are computed by zinbwave are only for visualization purposes. The fact that the residuals show some batch effects is somewhat problematic, and may indicate that there are some non-linear batch effects that remain after regressing out the batch indicator.

I'm not sure why including more genes will result in such residual batch effects, but it may just reflect different groups of genes having a different signal to noise ratio.

How does the W matrix look like? Does that exhibit batch effects? If not, and since the rest of the analysis is based on W, it's probably not too much of a problem that the residuals are not perfect.

You may also want to look for covariates that correlate with batches (e.g., number of detected genes) and see if including those in the model helps.

abasseville commented 6 years ago

Hi, Thank for your answer. Indeed, W matrix looks fine, and "0 counts" genes are the ones inducing this batch effect for normalization. Why the number of genes detected is linked to the sample is a mystery (and it's like that for the samples in your tutorial as well as the samples I used from GEO (GSE75688) ,see attached file below). The fact is I can't rely only on W matrix since I need normalised data since I'm using gene signature algorithm like Estimate or Pam50 on my samples, therefore I need for that the normalised expression data for a certain number of genes.

normalization_zinb_3.pdf

drisso commented 6 years ago

If your goal is normalization, you can have a look at scone which is specifically designed for normalization (unlike zinbwave).

abasseville commented 6 years ago

ok, thank you for your help and your time! I really appreciate it.

drisso / zinbwave

normalization with 5000 or 10000 filtered genes #29