MarioniLab / FurtherMNN2018

Code for further development of the mutual nearest neighbours batch correction method, as implemented in the batchelor package.
22 stars 6 forks source link

Is MNN suitable for TPM or CPM? How to set the value of k? #4

Closed yhting closed 6 years ago

yhting commented 6 years ago

In the paper “Batch effects in single-cell Rna-sequencing data are corrected by matching mutual nearest neighbors”, the input data is counts, however, our expression profiles contain TPM or CPM values. So , Is MNN suitable for TPM or CPM? And if applicable, how should I normalize the dataset to remove cell-specific biases? And since the number of cells in each batch didn’t reach 1000, even more, there is a batch with only 24 single cells, I don’t know how to set the value of parameter k. Would you mind to give some suggestions?

LTLA commented 6 years ago
  1. I suppose there's no reason that the MNN approach couldn't be used on log-TPM or log-CPM values. However, your mileage may vary because the pseudo-count has no obvious interpretation when you add it to a CPM or TPM value (as compared to adding it to the original counts) - see my rant here.
  2. how should I normalize the dataset to remove cell-specific biases? I don't understand this question. If you've computed TPMs or CPMs, you've already performed library size normalization. So you've already removed most technical biases except for composition bias. If you want to account for composition bias (which is probably the safe thing to do), you could try using the CPMs/TPMs in computeSumFactors; but again, your mileage may vary as the function expects counts.
  3. k is the smallest expected size of any subpopulation in the data. The default of 20 means that we expect at least 20 cells in any given subpopulation. In your case, this is unlikely to be true for the smallest batch. One could try merging the smallest batch to a larger batch with low k, and then using the default k for subsequent merges. An example of how to manually specify a non-linear merge sequence is provided in pancreas/plotCorrection.R for fastMNN.

P.S. Life would be a lot easier for you if you just started with the counts. If you have CPMs, you must have had counts at some point in your processing pipeline.

yhting commented 6 years ago

@LTLA The TPM expression profile has been log-transformed( log2(tpm+1) ). And you mean that I can use the log-TPM or log-CPM values as the input data without further processing, right?

LTLA commented 6 years ago

Maybe. As I said, your mileage may vary. I've never tried it myself.

yhting commented 6 years ago

Thanks for your reply.