multiBatchNorm: Error in .rescale_size_factors(stats$averages, stats$size.factors, min.mean = min.mean) : median ratio of averages between batches is not finite

Yunuuuu commented 1 year ago

Hi, when I use multiBatchNorm to do normalization and adjust for the sequencing depth, I tried to figure out it by changing the argument min.mean, but it cannot work. And it seems this problem has never been reported since I googled and searched this in github without any result.

When we encountered this error, does it mean multiBatchNorm is not fit for out data and we should do something else like cosineNorm function?

Yunuuuu commented 1 year ago

Hi, @LTLA, by the way, some questions about the usage of multiBatchNorm after scaning the bioconductor support site and github. I'm trying to find the most right and practice workflow for multi-sample single cell analysis. Much doubts about the sequential order of multiBatchNorm and modeling variance (modelGeneVar)

from https://bioconductor.riken.jp/packages/3.8/workflows/vignettes/simpleSingleCell/inst/doc/work-5-mnn.html

Technically, we should have performed variance modelling and feature selection after calling multiBatchNorm(). This ensures that the variance components are estimated from the same values to be used in the batch correction. In practice, this makes little difference, and it tends to be easier to process each batch separately and consolidate all results in one step as shown above)

and from https://support.bioconductor.org/p/9145895/#9145905

# NOTE: not tested!
sce.sub <- sce[,some_kind_of_subset]
dec.sub <- modelGeneVar(sce.sub, block=sce.sub$batch)
sce.sub <- multiBatchNorm(sce.sub, batch=sce.sub$batch)
sce.sub <- fastMNN(sce.sub, batch=sce.sub$batch, subset.row=getTopHVGs(dec.sub))

and the same in the http://bioconductor.org/books/3.16/OSCA.multisample/integrating-datasets.html#slower-setup

All of these indicates it's much better to do modelGeneVar in the original normalized couns and then do multiBatchNorm for fastMNN, but the source code of quickCorrect function (https://github.com/LTLA/batchelor/blob/master/R/quickCorrect.R) runs modelGeneVar in the normalized counts of multiBatchNorm.

And from the comment of https://github.com/LTLA/batchelor/issues/16#issuecomment-613271986:

For findMarkers, I typically do my DE analyses on the normalized expression values before running multiBatchNorm. It probably doesn't make a big practical difference whether you run it before or after; I guess I do it before for the philosophical reason that I like my multi-batch DE comparisons to be computed off the same values that were used for the single-batch comparisons.

it seems we should do most operation in the original normalized counts but not the normalized counts by multiBatchNorm, in this way, does it more help to keep the original logcounts? Someting like this:

Incuding multiBatchNorm in fastMNN function since we most operate multiBatchNorm function with fastMNN function and we don't need the normalized counts for other downstream analysis (what we want mostly is the corrected matrix returned by fastMNN? Maybe wrong) and from the help page cosineNorm is secondary compared with multiBatchNorm.
changing the default behaviour of multiBatchNorm and returned an assay name other than "logcounts" by using norm.args = list(name = "mbncounts") ?

LTLA commented 1 year ago

Hi, when I use multiBatchNorm to do normalization and adjust for the sequencing depth, I tried to figure out it by changing the argument min.mean, but it cannot work. And it seems this problem has never been reported since I googled and searched this in github without any result.

If you recall how multiBatchNorm() works, it creates an average pseudo-cell for each batch by averaging together all the cells, and then it tries to normalize the average pseudo-cells across batches. The second step is done using a DESeq-like approach where we take the median of the ratios of the averages. This ensures that the scaling between batches is not overly affected by composition biases, and usually it works well because the average vectors are no longer sparse (especially after filtering for a minimum min.mean).

In your case, somehow one or the other average pseudo-cell has >50% of zeros, even after filtering for a minimum mean. This implies that one batch is completely silent for many genes that are expressed in another batch. I don't know why this might happen, but it seems pretty extreme; you might consider finding these genes and removing them, because it probably won't be sensible to include them in a batch correction step.

Yunuuuu commented 1 year ago

Thanks for your reply. I found this situation commonly when we mixed samples with different tissure types. I mixed blood samples and tumor samples, then I gained epithelial cells after QC, normalization, MNN, clustering and annotation, in this way, there will be small set of epithelial cells from blood (1-5 cells on average in each blood sample, I think this is false positive). subsequently, I subsetted epithelial cells and run multiBatchNorm and MNN. I thought the error was due to the small set of cells in one batch. After filtering batches with only a few cells, I run the workflow successfully, I don't know if this is correct.

By following your advice, is it good to use HVGs to run multiBatchNorm ?

LTLA / batchelor

multiBatchNorm: Error in .rescale_size_factors(stats$averages, stats$size.factors, min.mean = min.mean) : median ratio of averages between batches is not finite #43