elbamos / largeVis

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R
340 stars 62 forks source link

Import K-nn feature (external K-nn) #42

Closed vmarkovtsev closed 6 years ago

vmarkovtsev commented 7 years ago

Hey, the author of kmcuda here.

I've got very fast K-nn implementation on GPU which scales to millions by hundreds. I guess this project would benefit from it. The easiest way of integrating the stuff seems to add the ability to import K-nn assignments in largeVis API (that is, an additional function argument).

I can try to make a PR myself (in some near future). Alternatively, it must be straightforward for you add it. What do you prefer?

elbamos commented 7 years ago

Sure, happy to try it out! You can submit a PR or we can work on it together.

To get it integrated I'll need the new code, code that turns it off and on depending on whether the required hardware is present, appropriate tests, and any required modifications to the travis-ci configuration to test it.

If you really got a speed-up this way, great job! I tried it for a few days but wasn't able to get better performance out of the GPU than using the current algorithm.

vmarkovtsev commented 7 years ago

Great. Actually, my plan is to avoid all this additional code, tests, build complications, etc. I suggest to add the "import" feature, a universal solution which allows to use any Knn algorithm In the world (including kmcuda). Much like Kmeans initialization usually allows to specify the custom centroids to start with.

Is there any standard dataset I can benchmark on?

elbamos commented 7 years ago

Well i benchmark on mnist.

But have you looked at my benchmarks page? The annoy benchmarks page? The closest thing to a standard knn benchmark dataset is sift. This is one of the reasons I think a gpu implementation is challenging - just putting the data into the gpu will consume quite a few gb of gpu ram.

I can certainly appreciate the logic behind a universal knn interface, but there are currently like a dozen packages implementing different knn algorithms with different apis.

On Feb 28, 2017, at 1:17 PM, Vadim Markovtsev notifications@github.com wrote:

Great. Actually, my plan is to avoid all this additional code, tests, build complications, etc. I suggest add the "import" feature, a universal solution which allows to use any Knn algorithm In the world (including kmcuda). Much like Kmeans initialization usually allows to specify the custom controids to start with.

Is there any standard dataset I can benchmark on?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

vmarkovtsev commented 7 years ago

My GPU has 12 and it is enough provided that I am running Knn on 4 in parallel :) Normally single GPU is 100x improvement over single CPU core, though of course CPU is not following the brute force approach. At the same time since it takes like 5 mins to cluster 10M samples in 256 dimensions, Knn leverages the clusters, typically reducing brute force work by a factor of 10 (depends on the dataset).

I am not suggesting any universal API. Consider kmeans function in R. It has centers argument. It is either a number or a matrix. The latter turns off internal centroids initialization and imports the provided ones. It will always work with any number of craziest libraries and modules, e.g. AFK-MC2, forever. I am suggesting the same feature for largeVis: by default, calculate Knn using the internal algorithm but be able to simply use the raw Knn resulting assignments as-is.

elbamos commented 7 years ago

Ok. I'm not sure I completely understand but happy to take a look at whatever you propose!

On Feb 28, 2017, at 3:28 PM, Vadim Markovtsev notifications@github.com wrote:

My GPU has 12 and it is enough provided that I am running Knn on 4 in parallel :) Normally single GPU is 100x improvement over single CPU core, though of course CPU is not following the brute force approach. At the same time since it takes like 5 mins to cluster 10M samples in 256 dimensions, Knn leverages the clusters, typically reducing brute force work by a factor of 10 (depends on the dataset).

I am not suggesting any universal API. Consider kmeans function in R. It has centers argument. It is either a number or a matrix. The latter turns off internal centroids initialization and imports the provided ones. It will always work with any number of craziest libraries and modules, e.g. AFK-MC2, forever. I am suggesting the same feature to largeVis: by default, calculate Knn using internal algorithm but be able to simply use the raw Knn resulting assignments as-is.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

vmarkovtsev commented 7 years ago

I have discovered https://github.com/facebookresearch/faiss today, which is the much more advanced K-nn from Facebook AI.