MarioniLab / FurtherMNN2018

Code for further development of the mutual nearest neighbours batch correction method, as implemented in the batchelor package.
22 stars 6 forks source link

Do I need do mnn correct again after subset data? #6

Closed zacharylau10 closed 5 years ago

zacharylau10 commented 5 years ago

Hi, Thanks for developing this fantastic package. There is an question when I precessed my data using MNN correction and I want to choose some subclusters to further analysis. Should I do MNN correction again?

LTLA commented 5 years ago

No, you don't have to re-correct, and actually, re-correcting may be harmful. In the most extreme case, if you select subclusters that are unique to each batch, recorrecting would incorrectly eliminate genuine differences between subclusters. Conversely, the definition of your subclusters would depend on the initial MNN correction so if that correction wasn't satisfactory then you're already in trouble.

I can't see a clear use case for correcting, subsetting and correcting again, for the reasons discussed above. The only situation where I would perform multiple rounds of correction is when I am doing a hierarchical merge across batches, and this requires some care to avoid repeating unnecessary steps.

zacharylau10 commented 5 years ago

Thank you~

GangLiTarheel commented 5 years ago

Hi,

Thanks for the discussion. Your answers also solved my problem.

But I am a little confused about the results of MNN. When I used a small subset of data to select neighbors, I can find even more neighbors than using the whole dataset. I thought it should always be smaller. Any thoughts or comments on that?

Thank you for your time. Gang

LTLA commented 5 years ago

Here's one possible explanation.

I have subpopulations A and B in batch 1, and subpopulations B' and C' in batch 2. Assume that they're arranged with an batch effect orthogonal to the within-batch differences:

A---B         # Batch 1
    B'---C'   # Batch 2

With the full data, the MNNs are correctly identified between the matching subpopulations B and B', while A and C' have no MNNs. However, if you subset the data to remove B and B', you get:

A             # Batch 1
         C'   # Batch 2

... where the MNNs form (probably incorrectly) between A and C'. This reflects my previous warnings about subsetting. If A and C' happen to be larger than B and B', you'll get more MNNs with the former.

GangLiTarheel commented 5 years ago

Thank you!