Teichlab / bbknn

Batch balanced KNN
MIT License
149 stars 25 forks source link

Understanding `neighbors_within_batch` parameter? #19

Closed chris-rands closed 4 years ago

chris-rands commented 4 years ago

Thanks for the nice tool! I'm trying to conceptually understand the neighbors_within_batch parameter. I read the docstring, but I'm still not clear exactly what this means? Is it 'k' when approx=True? Setting this value higher leads to a more spread out UMAP (i.e. less correction), which may be preferable for some datasets? Is there a reason for the default value of 3?

https://github.com/Teichlab/bbknn/blob/7e736d4eea36369b1ad426667eb1d7b90ad0fd9f/bbknn/__init__.py#L216-L218

ktpolanski commented 4 years ago

Thanks for the kind words, sorry for the slightly delayed reply - I need to start regularly checking the email tied to my GitHub again.

BBKNN performs a KNN search for each batch individually, and then merges the resulting neighbour lists together. This parameter is the k for that search, for each batch. The value of 3 stems from the fact that when computing the KNN for the batch a particular cell is from, the returned KNN will include the cell itself as one of the KNN regardless of the neighbour identification algorithm. As such, having fewer than two neighbours within a batch feels excessive. The value can be adjusted if desired, but is kept low as it tends to lead to better correction (as you noticed) while also improving run time.