tsnecuda fails with a large number of points using FAISS 1.7

DavidMChan commented 3 years ago

It seems like tsnecuda is experiencing the same issues as in https://github.com/facebookresearch/faiss/issues/1793. Running the code with ./tsne -k 500000 (500000 2D points drawn from a pair of gaussians) gives:

Starting TSNE calculation with 500000 points.
Initializing cuda handles... done.
KNN Computation... Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::ivfInterleavedScanImpl_32_(faiss::gpu::Tensor<float, 2, true>&, faiss::gpu::Tensor<int, 2, true>&, thrust::device_vector<void*>&, thrust::device_vector<void*>&, faiss::gpu::IndicesOptions, thrust::device_vector<int>&, int, faiss::MetricType, bool, faiss::gpu::Tensor<float, 3, true>&, faiss::gpu::GpuScalarQuantizer*, faiss::gpu::Tensor<float, 2, true>&, faiss::gpu::Tensor<long int, 2, true>&, faiss::gpu::GpuResources*) at /home/davidchan/Repos/faiss/faiss/gpu/impl/scan/IVFInterleaved32.cu:13; details: CUDA error 9 invalid configuration argument
Aborted (core dumped)

Originally posted by @kernfel in https://github.com/CannyLab/tsne-cuda/issues/95#issuecomment-824528732

DavidMChan commented 3 years ago

@kernfel - What version of CUDA/GCC are you using? Also, are you installing FAISS with the conda installation, or the from-scratch FAISS install?

kernfel commented 3 years ago

Cuda toolkit 11.3 GCC -- I may have inadvertently used v10 here... seems my update-alternatives weren't up to date. FAISS -- building from source.

DavidMChan commented 3 years ago

I'm able to reproduce with 500,000 points with CUDA 11.2, gcc 9.3, building both from source. Downgrading to a CPU index does seem to fix the problem, which suggests that the issue is with FAISS gpu index and not with our downstream code.

For anyone at FAISS, the offending code is here:

const int32_t kNumCells = static_cast<int32_t>(
        std::sqrt(static_cast<float>(num_points)));
    const int32_t kNumCellsToProbe = 20;

    // Construct the CPU version of the index
    faiss::IndexFlatL2 quantizer(num_dims);
    faiss::IndexIVFFlat cpu_index(&quantizer, num_dims, kNumCells, faiss::METRIC_L2);
    cpu_index.nprobe = kNumCellsToProbe;

    if (num_near_neighbors < 1024)
    {
        int ngpus = faiss::gpu::getNumDevices();
        std::vector<faiss::gpu::GpuResourcesProvider *> res;
        std::vector<int> devs;
        for (int i = 0; i < ngpus; i++)
        {
            res.push_back(new faiss::gpu::StandardGpuResources);
            devs.push_back(i);
        }

        // Convert the CPU index to GPU index
        faiss::Index *search_index = faiss::gpu::index_cpu_to_gpu_multiple(res, devs, &cpu_index);

        search_index->train(num_points, points);
        search_index->add(num_points, points);
        search_index->search(num_points, points, num_near_neighbors, distances, indices);

        delete search_index;
        for (int i = 0; i < ngpus; i++)
        {
            delete res[i];
        }
    }
    else
    {
        // Construct the index table on the CPU (since the GPU
        // can only handle 1023 neighbors)
        cpu_index.train(num_points, points);
        cpu_index.add(num_points, points);
        // Perform the KNN query
        cpu_index.search(num_points, points, num_near_neighbors,
                         distances, indices);
    }

The CPU path (if forced, even with a neighbors < 1024) works, while the GPU path doesn't,

DavidMChan commented 3 years ago

Second update: It doesn't seem to be limited to the flat index. The IVFPQ index also seems to have the same error:

Starting TSNE calculation with 500000 points.
Initializing cuda handles... done.
KNN Computation... Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::runTransposeAny(faiss::gpu::Tensor<OtherT, OtherDim, true, int, faiss::gpu::traits::DefaultPtrTraits>&, int, int, faiss::gpu::Tensor<OtherT, OtherDim, true, int, faiss::gpu::traits::DefaultPtrTraits>&, cudaStream_t) [with T = float; int Dim = 3; cudaStream_t = CUstream_st*] at /home/davidchan/Repos/faiss/faiss/gpu/utils/Transpose.cuh:218; details: CUDA error 9 invalid configuration argument
Aborted (core dumped)

DavidMChan commented 3 years ago

kernfel commented 3 years ago

Got my build issues under control and can confirm that FAISS v1.6.5 does not have this issue.

DavidMChan commented 1 year ago

Resolved in latest.

CannyLab / tsne-cuda

tsnecuda fails with a large number of points using FAISS 1.7 #98