facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.49k stars 3.56k forks source link

code_size mismatch between index and faiss.extract_index_ivf(index).invlists in case of IndexIVFPQFastScan. #2407

Open ololo123321 opened 2 years ago

ololo123321 commented 2 years ago

Summary

Hello! I noticed strange things with code_size of IndexIVFPQFastScan:

import faiss
index = faiss.index_factory(128, "IVF1024,PQ64x4fs", faiss.METRIC_INNER_PRODUCT)
index_ivf = faiss.extract_index_ivf(index)
assert index.code_size == 64 * 4 // 8
assert index_ivf.code_size == index.code_size
assert index_ivf.code_size != index_ivf.invlists.code_size
assert index_ivf.invlists.code_size == 2 ** 64 - 1

The 3rd and 4th asserts are surprising for me. The problem is that the inequality is the reason of the following error in function faiss.contrib.ondisk.merge_ondisk:

RuntimeError: Error in size_t faiss::OnDiskInvertedLists::merge_from(const faiss::InvertedLists**, int, bool) at /project/faiss/faiss/invlists/OnDiskInvertedLists.cpp:578: Error: 'il->nlist == nlist && il->code_size == code_size' failed
  1. Could you please explain the last assert?
  2. If this behavior is expected, maybe it should be added one more type check here ?

Platform

OS: macOS 12.4, ubuntu 20.04

Faiss version: 1.7.1, 1.7.2

Installed from: conda, pip

Running on:

Interface:

Reproduction instructions

import faiss
index = faiss.index_factory(128, "IVF1024,PQ64x4fs", faiss.METRIC_INNER_PRODUCT)
index_ivf = faiss.extract_index_ivf(index)
assert index_ivf.code_size == index_ivf.invlists.code_size
BarclayII commented 2 years ago

code_size == 2 ** 64 - 1 (or code_size == (unsigned long)(-1)) means that the code size is invalid for the inverted index. That's because IndexIVFPQFastScan uses BlockInvertedLists (in IndexIVFFastScan::init_fastscan()) where code_size is invalid (in a comment in faiss::InvertedLists). So I guess merge_from can't be used for the indices that uses fast scan.

ololo123321 commented 2 years ago

Thanks for reply! Yes, have found this comment, and now it's clear for me why there is such a value of code_size. Interesting, that if add row ivf.code_size = index_ivf.code_size before this one, the index will be created and search will work, but search results would be random.

mdouze commented 2 years ago

to clarify: TODO is implement merge (and remove, while we are at it) for fast_scan indexes.