Open Yunuuuu opened 1 year ago
Hi, @LTLA, by the way, some questions about the usage of multiBatchNorm
after scaning the bioconductor support site and github. I'm trying to find the most right and practice workflow for multi-sample single cell analysis. Much doubts about the sequential order of multiBatchNorm
and modeling variance (modelGeneVar
)
Technically, we should have performed variance modelling and feature selection after calling multiBatchNorm(). This ensures that the variance components are estimated from the same values to be used in the batch correction. In practice, this makes little difference, and it tends to be easier to process each batch separately and consolidate all results in one step as shown above)
and from https://support.bioconductor.org/p/9145895/#9145905
# NOTE: not tested!
sce.sub <- sce[,some_kind_of_subset]
dec.sub <- modelGeneVar(sce.sub, block=sce.sub$batch)
sce.sub <- multiBatchNorm(sce.sub, batch=sce.sub$batch)
sce.sub <- fastMNN(sce.sub, batch=sce.sub$batch, subset.row=getTopHVGs(dec.sub))
and the same in the http://bioconductor.org/books/3.16/OSCA.multisample/integrating-datasets.html#slower-setup
All of these indicates it's much better to do modelGeneVar
in the original normalized couns and then do multiBatchNorm
for fastMNN
, but the source code of quickCorrect
function (https://github.com/LTLA/batchelor/blob/master/R/quickCorrect.R) runs modelGeneVar
in the normalized counts of multiBatchNorm
.
And from the comment of https://github.com/LTLA/batchelor/issues/16#issuecomment-613271986:
For findMarkers, I typically do my DE analyses on the normalized expression values before running multiBatchNorm. It probably doesn't make a big practical difference whether you run it before or after; I guess I do it before for the philosophical reason that I like my multi-batch DE comparisons to be computed off the same values that were used for the single-batch comparisons.
it seems we should do most operation in the original normalized counts but not the normalized counts by multiBatchNorm
, in this way, does it more help to keep the original logcounts
? Someting like this:
multiBatchNorm
in fastMNN
function since we most operate multiBatchNorm
function with fastMNN
function and we don't need the normalized counts for other downstream analysis (what we want mostly is the corrected matrix returned by fastMNN? Maybe wrong) and from the help page cosineNorm
is secondary compared with multiBatchNorm
.multiBatchNorm
and returned an assay name other than "logcounts" by using norm.args = list(name = "mbncounts")
?Hi, when I use
multiBatchNorm
to do normalization and adjust for the sequencing depth, I tried to figure out it by changing the argumentmin.mean
, but it cannot work. And it seems this problem has never been reported since I googled and searched this in github without any result.
If you recall how multiBatchNorm()
works, it creates an average pseudo-cell for each batch by averaging together all the cells, and then it tries to normalize the average pseudo-cells across batches. The second step is done using a DESeq-like approach where we take the median of the ratios of the averages. This ensures that the scaling between batches is not overly affected by composition biases, and usually it works well because the average vectors are no longer sparse (especially after filtering for a minimum min.mean
).
In your case, somehow one or the other average pseudo-cell has >50% of zeros, even after filtering for a minimum mean. This implies that one batch is completely silent for many genes that are expressed in another batch. I don't know why this might happen, but it seems pretty extreme; you might consider finding these genes and removing them, because it probably won't be sensible to include them in a batch correction step.
Thanks for your reply. I found this situation commonly when we mixed samples with different tissure types. I mixed blood samples and tumor samples, then I gained epithelial cells after QC, normalization, MNN, clustering and annotation, in this way, there will be small set of epithelial cells from blood (1-5 cells on average in each blood sample, I think this is false positive). subsequently, I subsetted epithelial cells and run multiBatchNorm
and MNN
. I thought the error was due to the small set of cells in one batch. After filtering batches with only a few cells, I run the workflow successfully, I don't know if this is correct.
By following your advice, is it good to use HVG
s to run multiBatchNorm
?
Hi, when I use
multiBatchNorm
to do normalization and adjust for the sequencing depth, I tried to figure out it by changing the argumentmin.mean
, but it cannot work. And it seems this problem has never been reported since I googled and searched this in github without any result.When we encountered this error, does it mean
multiBatchNorm
is not fit for out data and we should do something else likecosineNorm
function?