Teichlab / bbknn

Batch balanced KNN
MIT License
142 stars 23 forks source link

considering using `pynndescent` for appox nn search? #40

Closed giovp closed 3 years ago

giovp commented 3 years ago

no more annoy dependency, pynndescent is also default for umap, and it performs much better

https://github.com/lmcinnes/pynndescent

image

ktpolanski commented 3 years ago

This is a pretty good idea actually, thanks for the suggestion - I've been annoyed by annoy for a while as the most recent version throws segfaults all over the place. This also resolves the whole angular metric discrepancy between annoy and UMAP. Need to see how easy this is to install and how efficient index construction is (I remember some of the advertised fast query approaches spending forever crafting theirs, negating the point), but I'm hopeful!

What do you mean by this being a UMAP default by the way?

giovp commented 3 years ago

right, had those segfaults issues just recently 😅

What do you mean by this being a UMAP default by the way?

since umap 0.5 pynndescent is a hard dependency https://umap-learn.readthedocs.io/en/latest/release_notes.html#what-s-new-in-0-5

ktpolanski commented 3 years ago

Just dropping a line that I haven't forgotten about this, but I've had other stuff on my plate recently. I've finally done the long-overdue BBKNN refactor in preparation for this, creating a matrix division like in RBCDE and MultiMAP.

I'll give pynndescent a shot at some point soon and very likely switch over.

ktpolanski commented 3 years ago

I've finished this up, and it's live in 1.5.0.

Personally, I've found pynndescent run times weirdly slow. The pancreas data is 4x slower than annoy, and even 2x slower than cKDTree - an exact neighbour search algorithm. However, pynndescent does natively support a lot of metrics, and ingests custom functions too, so supporting it increases the package's functionality.