Closed Prabhat1808 closed 2 months ago
It is not safe to use one GPU on multiple processes.
@mdouze I see. Is the above issue because FAISS does not support multiple process using the same GPU (as its unsafe)? Or is it something else?
P.S. Would like to understand why it is unsafe, so as to see if it works for my use-case, even if in general it is not recommended. Can you point me to the relevant resources?
Thanks
According to cudaHostAlloc return 3, which possible mean GPU is not available.
@Prabhat1808 maybe you could check the below code:
clustering_index = faiss.index_cpu_to_gpu(res, 1, clustering_index_cpu)
Here device=1, which means the 2nd card on your system, do you really have at least 2 GPUs here? the device parameter starts from 0.
@matrixji the machine has 3 GPUs, so the above is not the cause of error. Moreover, the 2nd card is available and the index.train()
step works normally.
The issue occurs when I try to create multiple processes and use them to train multiple indexes in-parallel, as mentioned above.
num_proc = 4 proc_pool = Pool(num_proc) res = proc_pool.map(build_index, [data]*num_proc)
Got it, Actually, I've tried running your code on faiss(compile from master), and it succeeds. So, which GPU card you're using, and how about your host memory? I've noticed it may require about 10GB GPU memory and about 8GB host memory for your code(As failed while hostAlloc, probably host memory does not meet the requires).
Summary
Using python multiprocessing library to train multiple indexes on a GPU, in-parallel, throws the following error ->
RuntimeError: Error in virtual void faiss::gpu::StandardGpuResourcesImpl::initializeForDevice(int) at /root/miniconda3/conda-bld/faiss-pkg_1623030479928/work/faiss/gpu/StandardGpuResources.cpp:283: Error: 'err == cudaSuccess' failed: failed to cudaHostAlloc 268435456 bytes for CPU <-> GPU async copy buffer (error 3 initialization error) """
Platform
OS: Ubuntu 18.04.6 LTS
Faiss version: 1.7.1
Installed from: anaconda
Faiss compilation options:
Running on:
Interface:
Reproduction instructions
Index Creation and Training Function ->
Creating random data, for ease of issue reproduction ->
data = np.random.randint(1, 200, (99529, 72), dtype='uint8')
Running on single process ->
Output on single process ->
Running on multiple process (using python multiprocessing pool) ->
Throws the following error ->