sunhongmin225 commented 3 years ago

Platform

Running on:

[ ] CPU
[x] GPU

Interface:

[x] C++
[ ] Python

Summary

Hi, for analyzing GPU searches, I have seen below expression in `GpuIndex::search` in `GpuIndex.cu`. `DeviceScope scope(config_.device);` `auto stream = resources_->getDefaultStream(config_.device);` `auto outDistances = toDeviceTemporary( resources_.get(), config_.device, distances, stream, {(int) n, (int) k});` `auto outLabels = toDeviceTemporary( resources_.get(), config_.device, labels, stream, {(int) n, (int) k});` It seems that each GPU holds outDistances and outLabels, and accumulates top-k results on them separately. However, in order to return the **final** top-k output, I believe that these separate top-k results must be merged when more than one GPUs collect top-k results separately. I've struggled to locate where you were doing this job, but couldn't. Could you tell me where or how you deal with merging final top-k results of multiple GPUs? Additional question. If, for example, 4 GPUs are active, does each GPU work on quarter of dataset? I'm curious how GPUs divide their workloads. Thanks for your very helpful searcher implementation. Best, Min.

wickedfoo commented 3 years ago

The kernels exist to do this in the Faiss library (e.g., concatenate the partial results together, then k-select the concatenated data), however they are not currently wired together to do this exclusively on the GPUs. This is something that we could do if enough users are interested in it though.

The implementation at the moment is for both CPU and GPU, which does all the execution on the CPU, here:

https://github.com/facebookresearch/faiss/blob/master/faiss/IndexShards.cpp#L45

If IndexShards is used for GPU indices, then the data will be copied to the CPU and merged on the CPU using this function.

sunhongmin225 commented 3 years ago

Great. Lots of thanks for your super clear explanation.

One more additional question that I've written above, please: say, I'm using 4 GPUs to deal with sift1m dataset. Is it correct that each GPU divide the workload with regards to the size of the dataset? I.e., does the first GPU handle the first 1M/4 = 250K rows of sift1m, the second handles the next 250K rows, ..., and so on? Also if this mechanism is correct, where can I find the code actually dividing workloads to multiple GPUs?

Best, Min.

mdouze commented 3 years ago

This is done via IndexShards and IndexReplicas, see the doc here:

https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU#using-multiple-gpus

sunhongmin225 commented 3 years ago

Thanks a lot for your help, @wickedfoo and @mdouze. I sincerely appreciate it.

Best, Min.

facebookresearch / faiss

How do multiple GPUs merge each of their top-k results? #1770

Platform

Summary