MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
40 stars 22 forks source link

multiBatchNorm with subset.row #37

Closed erbader closed 5 years ago

erbader commented 5 years ago

Hi, I am using multiBatchNorm function on two SingleCellExperiment objects with different numbers of rows. I want to use the subset.row argument to specify the common rows between the datasets to use for the normalization, but I keep getting an error that the number of rows is different between the dataset.

`> sce_cptp28_test class: SingleCellExperiment dim: 11745 1552 metadata(0): assays(1): counts rownames(11745): ENSG00000237491 ENSG00000225880 ... ENSG00000278817 ENSG00000271254 rowData names(0): colnames(1552): AAACGGGCATGCAATC AAACGGGTCGAATGCT ... TTTGTCATCAAGATCC TTTGTCATCCTATGTT colData names(0): reducedDimNames(0): spikeNames(0):

sce_cptp29_test class: SingleCellExperiment dim: 12200 2009 metadata(0): assays(1): counts rownames(12200): ENSG00000237491 ENSG00000225880 ... ENSG00000278384 ENSG00000271254 rowData names(0): colnames(2009): AAACCTGCAAGTAGTA AAACCTGCATCTATGG ... TTTGTCAGTTATGCGT TTTGTCAGTTGCTCCT colData names(0): reducedDimNames(0): spikeNames(0): str(var_genes) chr [1:2000] "ENSG00000143546" "ENSG00000115523" "ENSG00000101439" ... rescaled_test <- multiBatchNorm(sce_cptp28_test, sce_cptp29_test, subset.row = var_genes) Error in .check_batch_consistency(batches, byrow = TRUE) : number of rows is not the same across batches `

LTLA commented 5 years ago

Note that multiBatchNorm has been moved to https://github.com/LTLA/batchelor. Unfortunately I can't transfer this issue to a repository outside of this organization, so I'll just answer here.

As you've noticed, multiBatchNorm() checks that the nrows and rownames of all the inputs are the same. Subsetting by subset.row is performed after that check has passed, and will have no effect if the check fails. This is necessary as subset.row can also be an integer or logical vector, and if your input objects had differently ordered rows, the results of subsetting would be gibberish.

The correct approach is to ensure that all inputs have the same row number and order before supplying them to multiBatchNorm(). This is not particularly difficult, just subset the arguments beforehand. After all, if you've managed to get count matrices with the same gene IDs, then 99% of the work is already done.

You might then ask why we have a subset.row argument at all, if a user is expected to subset manually. The reason is that subset.row was designed to make it convenient to test out different gene sets, provided that all of the inputs were already consistent. For example, I might want to use the top 5000 HVGs, the top 1000 HVGs, my collaborator's favorite gene set, etc.

We require that the inputs are consistent, just to be on the safe side. If the inputs aren't consistent, the function won't try to make sense of it, it'll just fail. I feel this is the best approach to protect people from themselves - otherwise, users would just throw in data from diverse sources without thinking about the differences in annotation. This could lead to some undesired outcomes, e.g., where PCA is performed on only the mitochondrial genes, because these are the only ones that are named consistently across batches!

LTLA commented 5 years ago

I'm going to assume that this got explained satisfactorily...? Closing.