MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
39 stars 23 forks source link

scran normalize #101

Open jayypaul opened 2 years ago

jayypaul commented 2 years ago

Hello,

I have a heterogeneous dataset consisting of stroma and immune cells. For now, I'm interested in the stroma cells, and I was wondering if running scran again after subsetting the compartment of interest will lead to more accurate size factor estimation, since heterogenous data can produce negative estimates in some cases (which I witnessed but was able to address).

After subsetting, I have this many cells per sample:

image

And I would imagine that this would be a problem as well. Prior to subsetting I have this many cells per sample: image

I've also read that low number of cells per sample could be problematic with scran normalization... but just want to get some opinion from authors on which may be the better route forward.. run norm prior to subset, then subset? Or re run...

Thanks!

LTLA commented 2 years ago

If you're considering the analysis of each sample, then yes, the small number of cells in some of the samples will make normalization difficult. More specifically, this will introduce some instability in the estimates; the question is whether or not this instability is offset by the (assumed) improvement in accuracy once heterogeneity is out of the picture.

Having said that, if you've already subsetted it down to stroma cells and the subpopulations within the stroma subset are reasonably similar, you could just go with library size normalization (e.g., scuttle::librarySizeFactors). The expectation would be that there isn't a lot of composition biases that would motivate the use of scran's pooling normalization in the first place.

Alternatively, if you're analyzing all samples together and the batch effects are modest, you could run pooledSizeFactors on the set of all stroma cells. Any composition biases introduced by minor DE between batches would then be handled by the pooling normalization, while ensuring you have enough cells to get stable estimates.