facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.58k stars 3.65k forks source link

How to compare sub vector codes in FAISS IndexIVFPQ #3844

Closed ashleyabraham closed 2 months ago

ashleyabraham commented 2 months ago

I have a use case where I need to compare a given query vector's PQ code to the results PQ codes to see how many sub vectors match. I use M=5, where M is the # of sub vectors, I am expecting the encoded codes to have 5 centroids, but when I use sa_encode I get more than 5, I think it is from the IVF or includes IVF. Is sa_encode() the right way to encode to compare sub vector to see how many of them match. I tried index.pq.compute_codes() and index.pq.decode() the recall error rates are too big. The sa_encode and sa_decode recall rates are so much better. Appreciate any help! Thanks

mengdilin commented 2 months ago

The official API for encoding and decoding is via sa_encode and sa_decode. See wiki for examples of usage: https://github.com/facebookresearch/faiss/wiki/Vector-codecs

You are right, the reason returned codes from IVFPQ differs from M is due to the fact that we prepend the centroid id in front of the code. This is why you should always use sa_encode and sa_decode side-by-side so you don't have to worry about internal details like this.

Do you have a repro for low recall when using PQ's compute_codes() and decode() when compared against IVFPQ's standard encoding APIs? From reading the implementations, both heavily share the underlying implementations

ashleyabraham commented 2 months ago

Here is small example, encoded and decoded using stand alone(SA) and index.pq and calculated the RMSE.

I was expecting the IVFPQ and PQ code to be same except for the IVF ID, but it is not.

import faiss
import numpy as np

def calculate_rmse(y_pred, y_true):
  mse = np.mean((y_pred - y_true) ** 2)
  rmse = np.sqrt(mse)
  return rmse

np.random.seed(123)

# Generate sample data
n_samples, dim = 100000, 15
X = np.random.rand(n_samples, dim).astype(np.float32)

# Create an inverted files index with PQ quantization
nlist = 1000  # Number of inverted lists
m = 5        # Number of subquantizers
quantizer = faiss.IndexFlatL2(dim)  # Quantizer index
index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, 8)  # 8 bits per subquantizer

# Add data to the index
index.train(X)  # Train the index
index.add(X)

# Search for nearest neighbors
k = 10
id = 99696
original_vec = X[id:id+1]

pq_coded = index.pq.compute_codes(original_vec)
pq_decoded = index.pq.decode(pq_coded)

print('PQ RMSE:', calculate_rmse(pq_decoded, original_vec)) 

sa_coded = index.sa_encode(original_vec)
sa_decoded = index.sa_decode(sa_coded)

#stand Alone
print('SA RMSE: ', calculate_rmse(sa_decoded, original_vec)) 
PQ RMSE: 0.36062226
SA RMSE:  0.049256746
mengdilin commented 2 months ago

It looks like the recall difference comes from encoding and decoding by residual. By default, IndexIVFPQ will encode/decode by residual while encoding directly with pq always means not by residual.

You can get the same recall when you turn off residual

index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, 8)  # 8 bits per subquantizer
index.by_residual = False
mdouze commented 2 months ago

In IVFPQ, the PQ encodes the residual, which is the vector relative to the centroid it is assigned to (unless by_residual is set to false), see Product quantization for NN search, IV.C. This is generally more accurate than directly encoding the vector with the PQ. In this case, it is incorrect to apply PQ to the vectors themselves because the PQ was trained to handle residuals, so it results in a high reconstruction error. Calling sa_encode / sa_decode generates PQ codes prefixed with the centroid id encoded on 2 bytes (thus sa_coded is also larger than pq_coded)

ashleyabraham commented 2 months ago

That makes sense, appreciate the clarification, thank you!

On Thu, Sep 12, 2024 at 10:31 PM Matthijs Douze @.***> wrote:

In IVFPQ, the PQ encodes the residual, which is the vector relative to the centroid it is assigned to (unless by_residual is set to false), see Product quantization for NN search https://inria.hal.science/inria-00514462/file/jegou_pq_postprint.pdf, IV.C. This is generally more accurate than directly encoding the vector with the PQ. In this case, it is incorrect to apply PQ to the vectors themselves because the PQ was trained to handle residuals, so it results in a high reconstruction error. Calling sa_encode / sa_decode generates PQ codes prefixed with the centroid id encoded on 2 bytes (thus sa_coded is also larger than pq_coded)

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/faiss/issues/3844#issuecomment-2347967576, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXC25FQT6M5ZKBRHNUK4WDZWJMC7AVCNFSM6AAAAABNZQN6HSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXHE3DONJXGY . You are receiving this because you authored the thread.Message ID: @.***>