Closed ashleyabraham closed 2 months ago
The official API for encoding and decoding is via sa_encode
and sa_decode
. See wiki for examples of usage: https://github.com/facebookresearch/faiss/wiki/Vector-codecs
You are right, the reason returned codes from IVFPQ differs from M
is due to the fact that we prepend the centroid id in front of the code. This is why you should always use sa_encode
and sa_decode
side-by-side so you don't have to worry about internal details like this.
Do you have a repro for low recall when using PQ's compute_codes()
and decode()
when compared against IVFPQ's standard encoding APIs? From reading the implementations, both heavily share the underlying implementations
Here is small example, encoded and decoded using stand alone(SA) and index.pq and calculated the RMSE.
I was expecting the IVFPQ and PQ code to be same except for the IVF ID, but it is not.
import faiss
import numpy as np
def calculate_rmse(y_pred, y_true):
mse = np.mean((y_pred - y_true) ** 2)
rmse = np.sqrt(mse)
return rmse
np.random.seed(123)
# Generate sample data
n_samples, dim = 100000, 15
X = np.random.rand(n_samples, dim).astype(np.float32)
# Create an inverted files index with PQ quantization
nlist = 1000 # Number of inverted lists
m = 5 # Number of subquantizers
quantizer = faiss.IndexFlatL2(dim) # Quantizer index
index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, 8) # 8 bits per subquantizer
# Add data to the index
index.train(X) # Train the index
index.add(X)
# Search for nearest neighbors
k = 10
id = 99696
original_vec = X[id:id+1]
pq_coded = index.pq.compute_codes(original_vec)
pq_decoded = index.pq.decode(pq_coded)
print('PQ RMSE:', calculate_rmse(pq_decoded, original_vec))
sa_coded = index.sa_encode(original_vec)
sa_decoded = index.sa_decode(sa_coded)
#stand Alone
print('SA RMSE: ', calculate_rmse(sa_decoded, original_vec))
PQ RMSE: 0.36062226
SA RMSE: 0.049256746
It looks like the recall difference comes from encoding and decoding by residual. By default, IndexIVFPQ will encode/decode by residual while encoding directly with pq always means not by residual.
You can get the same recall when you turn off residual
index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, 8) # 8 bits per subquantizer
index.by_residual = False
In IVFPQ, the PQ encodes the residual, which is the vector relative to the centroid it is assigned to (unless by_residual
is set to false), see Product quantization for NN search, IV.C. This is generally more accurate than directly encoding the vector with the PQ.
In this case, it is incorrect to apply PQ to the vectors themselves because the PQ was trained to handle residuals, so it results in a high reconstruction error.
Calling sa_encode
/ sa_decode
generates PQ codes prefixed with the centroid id encoded on 2 bytes (thus sa_coded is also larger than pq_coded)
That makes sense, appreciate the clarification, thank you!
On Thu, Sep 12, 2024 at 10:31 PM Matthijs Douze @.***> wrote:
In IVFPQ, the PQ encodes the residual, which is the vector relative to the centroid it is assigned to (unless by_residual is set to false), see Product quantization for NN search https://inria.hal.science/inria-00514462/file/jegou_pq_postprint.pdf, IV.C. This is generally more accurate than directly encoding the vector with the PQ. In this case, it is incorrect to apply PQ to the vectors themselves because the PQ was trained to handle residuals, so it results in a high reconstruction error. Calling sa_encode / sa_decode generates PQ codes prefixed with the centroid id encoded on 2 bytes (thus sa_coded is also larger than pq_coded)
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/faiss/issues/3844#issuecomment-2347967576, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACXC25FQT6M5ZKBRHNUK4WDZWJMC7AVCNFSM6AAAAABNZQN6HSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXHE3DONJXGY . You are receiving this because you authored the thread.Message ID: @.***>
I have a use case where I need to compare a given query vector's PQ code to the results PQ codes to see how many sub vectors match. I use M=5, where M is the # of sub vectors, I am expecting the encoded codes to have 5 centroids, but when I use sa_encode I get more than 5, I think it is from the IVF or includes IVF. Is
sa_encode()
the right way to encode to compare sub vector to see how many of them match. I tried index.pq.compute_codes() and index.pq.decode() the recall error rates are too big. The sa_encode and sa_decode recall rates are so much better. Appreciate any help! Thanks