facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
29.36k stars 3.48k forks source link

Unable to utilize multiple gpus config_.device < getNumDevices()' failed: Invalid GPU device #3550

Open FalsoMoralista opened 1 week ago

FalsoMoralista commented 1 week ago

I'm trying to replace the cpu index by a gpu one but can't seem to do it on a distributed context.

Faiss version:
faiss 1.8.0 pypi_0 pypi faiss-gpu 1.8.0 py3.12_h4c7d538_0_cuda12.1.1 pytorch

Installed from miniconda.

Running on:

Interface:

Context

After initializing the K-means centroids for each value of K, I try to replace the default (cpu) index by a gpu one. This works for a single gpu device but fails when using multiple devices.

class KMeansModule:

    def __init__(self, nb_classes, dimensionality=256, n_iter=50, tol=1e-4, k_range=[2,3,4,5], resources=None, config=None):

        self.resources = resources
        self.config = config

        self.k_range = k_range
        self.d = dimensionality
        self.max_iter = n_iter
        self.tol = tol

        # Create the K-means object
        if len(k_range) == 1:
            self.n_kmeans = [faiss.Kmeans(d=dimensionality, k=k_range[0], niter=1, verbose=True, min_points_per_centroid = 1 ) for _ in range(nb_classes)]   
        else:
            # For each class, create n K-Means objects (one for each value of K), where n = len(k_range)
            # (this will be used to select the best K). 
            self.n_kmeans = []   
            for _ in range(nb_classes):
                self.n_kmeans.append([faiss.Kmeans(d=dimensionality, k=k, niter=1, verbose=False, min_points_per_centroid = 1) for k in k_range])                                                            

    def initialize_centroids(self, batch_x, class_id, resources, rank, device, config, cached_features):
        image_list = cached_features[class_id] # Use the features cached from the previous epoch                
        batch_x = torch.stack(image_list)

        # For each K (model selection)
        for k in range(len(self.k_range)):
            self.n_kmeans[class_id][k].train(batch_x.detach().cpu()) # Train K-means model for one iteration to initialize centroids 

            # Replace the regular index by a gpu one
            index_flat = self.n_kmeans[class_id][k].index

            gpu_index_flat = faiss.index_cpu_to_gpu(resources, rank, index_flat)
            self.n_kmeans[class_id][k].index = gpu_index_flat

res = faiss.StandardGpuResources()
initialize_centroids(batch_x = None, class_id, resources=res, rank=rank, device=device, cached_features)

Each rank (0, 1, 2, ..., 8) specifies the corresponding gpu device id.

Output

RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 7
Process Process-4:
Traceback (most recent call last):
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/main_deeper_cluster.py", line 52, in process_main
    app_main(args=params)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    k_means_module.init(resources=res, rank=rank, cached_features=cached_features_last_epoch, config=cfg, device=device) # E-step
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 98, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 92, in initialize_centroids
    gpu_index_flat = faiss.index_cpu_to_gpu(resources, rank, index_flat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 12799, in index_cpu_to_gpu
    return _swigfaiss_avx2.index_cpu_to_gpu(provider, device, index, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 3
Process Process-7:
Traceback (most recent call last):
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/main_deeper_cluster.py", line 52, in process_main
    app_main(args=params)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    k_means_module.init(resources=res, rank=rank, cached_features=cached_features_last_epoch, config=cfg, device=device) # E-step
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 98, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 92, in initialize_centroids
    gpu_index_flat = faiss.index_cpu_to_gpu(resources, rank, index_flat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 12799, in index_cpu_to_gpu
    return _swigfaiss_avx2.index_cpu_to_gpu(provider, device, index, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 6

Attempts

I have tried this as well (https://github.com/facebookresearch/DeeperCluster/blob/main/src/distributed_kmeans.py#L182), wondering that each process would initialize its own resources specifying the device number accordingly, but the same error happens.

res = faiss.StandardGpuResources()
cfg = faiss.GpuIndexFlatConfig()
cfg.device = rank

# Replace the regular index by a gpu one
index_flat = self.n_kmeans[class_id][k].index
gpu_index_flat = faiss.GpuIndexFlatL2(resources, self.d, config)
self.n_kmeans[class_id][k].index = gpu_index_flat 

Output

  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/main_deeper_cluster.py", line 52, in process_main
    app_main(args=params)
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    logger.info('Initializing centroids...')
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 98, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 91, in initialize_centroids
    gpu_index_flat = faiss.GpuIndexFlatL2(resources, self.d, config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 11575, in __init__
    _swigfaiss_avx2.GpuIndexFlatL2_swiginit(self, _swigfaiss_avx2.new_GpuIndexFlatL2(*args))
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in faiss::gpu::GpuIndex::GpuIndex(std::shared_ptr<faiss::gpu::GpuResources>, int, faiss::MetricType, float, faiss::gpu::GpuIndexConfig) at /home/circleci/miniconda/conda-bld/faiss-pkg_1709244513520/work/faiss/gpu/GpuIndex.cu:65: Error: 'config_.device < getNumDevices()' failed: Invalid GPU device 7
Process Process-6:

Other than that i have also tried the solution proposed here (https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU)

resources = [faiss.StandardGpuResources() for _ in range(world_size)]

index_flat = self.n_kmeans[class_id][k].index
gpu_index_flat = faiss.index_cpu_to_gpu_multiple(resources, devices=[0,1,2,3,4,5,6,7], index=index_flat)
self.n_kmeans[class_id][k].index = gpu_index_flat

Which generates:

  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/engine_deeper_cluster.py", line 401, in main
    k_means_module.init(resources=resources, rank=rank, cached_features=cached_features_last_epoch, config=None, device=device) # E-step
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 100, in init
    self.initialize_centroids(batch_x=None,
  File "/home/rtcalumby/adam/luciano/LifeCLEFPlant2022/DeepCluster/src/KMeans.py", line 93, in initialize_centroids
    gpu_index_flat = faiss.index_cpu_to_gpu_multiple(resources, devices=[0,1,2,3,4,5,6,7],index=index_flat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rtcalumby/miniconda3/envs/py382/lib/python3.12/site-packages/faiss/swigfaiss_avx2.py", line 12802, in index_cpu_to_gpu_multiple
    return _swigfaiss_avx2.index_cpu_to_gpu_multiple(provider, devices, index, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'index_cpu_to_gpu_multiple'.
  Possible C/C++ prototypes are:
    faiss::gpu::index_cpu_to_gpu_multiple(std::vector< faiss::gpu::GpuResourcesProvider * > &,std::vector< int > &,faiss::Index const *,faiss::gpu::GpuMultipleClonerOptions const *)
    faiss::gpu::index_cpu_to_gpu_multiple(std::vector< faiss::gpu::GpuResourcesProvider * > &,std::vector< int > &,faiss::Index const *)
FalsoMoralista commented 1 week ago

Tried this as well (from issue #878) , but without success: https://gist.github.com/mdouze/bfa06e7dc0869f0c0495928aab25800f

brendon-ribeiro918 commented 2 days ago

It depends on which devices you use.