facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.09k stars 3.62k forks source link

Assertion fails in compute_centroids() #2619

Closed EmielBoss closed 1 year ago

EmielBoss commented 1 year ago

Summary

I have an assert that fails, namely assert(ci >= 0 && ci < k + k_frozen); in compute_centroids() in Clustering.cpp. If I debug, I see that ci is indeed -1 for all i in the array assign. As far as I gather, this means that datapoints/training vectors have undefined centroids? I am not very good at reading code and walking it back further is rather cumbersome for me, so I was wondering if anyone can tell me a likely cause.

I have used Faiss with other datasets (EMNIST, self-generated datasets, etc.) without issues, but this assertion fails when I try it on hyperspectral images (using only labeled texels as datapoints).

Faiss version: 1.7.1 (I use vcpkg and I don't know which commit it uses, but it's the one associated with this commit of vcpkg.)

Installed from: vcpkg, compiled with Visual Studio Community 2022 compiler

Faiss compilation options: I simply use vcpkg's manifest mode like this:

"dependencies": [
...
 {
   "name": "faiss",
   "features": [ "gpu" ]
 }
]

Running on:

Interface:

Reproduction instructions

#include <exception>
#include <stdexcept>
#include <cstdlib>
#include <string>
#include <vector>
#include <cmath>
#include <iostream>
#include <fstream>
#include <cuda_runtime.h>
#include <faiss/gpu/StandardGpuResources.h>
#include <faiss/gpu/GpuIndexIVFFlat.h>

using uint = unsigned int;

int main(int argc, char** argv) {

  try {
    const uint n = 10249;
    const uint d = 220;

    // Read data
    std::vector<float> data(n * d);
    std::ifstream ifs("./indian_pines_10249n_220d.txt", std::ios::in | std::ios::binary);
    if (!ifs) { throw std::runtime_error("Input file cannot be accessed."); }
    // ifs.read((char *) data.data(), data.size() * sizeof(float)); // For binary data
    // For text data
    uint counter = 0;
    std::string token;
    while(std::getline(ifs, token, '|')) {
      float element = std::stoi(token);
      data[counter++] = element;
    }

    faiss::gpu::StandardGpuResources faissResources;
    faiss::gpu::GpuIndexIVFFlatConfig faissConfig;
    faissConfig.device = 0;
    faissConfig.indicesOptions = faiss::gpu::INDICES_32_BIT;
    faissConfig.flatConfig.useFloat16 = true;
    faissConfig.interleavedLayout = false;

    faiss::gpu::GpuIndexIVFFlat faissIndex(
      &faissResources,
      d,
      2 * static_cast<uint>(std::sqrt(n)),
      faiss::METRIC_L2,
      faissConfig
    );
    faissIndex.setNumProbes(12);
    faissIndex.train(n, data.data());

  } catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
    return EXIT_FAILURE;
  }
  return EXIT_SUCCESS;
}

The problem happens for the currently hardcoded indian_pines_10249n_220d.txt file as well as paviaU_42776n_103d.txt. Please change the hardcoded n and d variables (with values from the filename) when trying another dataset. Other datasets that I encoded in a similar manner work fine.

mdouze commented 1 year ago

Sorry, it is not straightforward (Faiss is not supposed to crash). Please post reproduction code.

EmielBoss commented 1 year ago

I updated the original post.

EmielBoss commented 1 year ago

Normalizing my input data between 0 and 1 resolved the problem.