facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.95k stars 3.61k forks source link

Wrong variable description in wiki Faiss building blocks: clustering, PCA, quantization? #2709

Open gitgithan opened 1 year ago

gitgithan commented 1 year ago

I'm refering to the code at https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization#example-pq-encoding--decoding

d = 32  # data dimension
cs = 4  # code size (bytes)

# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode 
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

The 2nd and 3rd inputs in faiss.ProductQuantizer were cs and 8, however when checking https://github.com/bitsun/faiss-windows/blob/master/ProductQuantizer.h#L27 , the 3 inputs are d, M, nbits.

Problem

I think the 2nd parameter M is number of segments and not cs code size as expressed on the wiki

Other related questions

  1. Why when i change 3rd input 8 to 4, the codes.shape drops from 4 columns to 2? I thought the number columns in codes is exactly the number of M (2nd input) and should not be varying with 3rd input?
  2. Is there supposed to be some interaction between M and nbits that determines number of columns in codes?
  3. Why is codes.max() 15 when nbits (3rd parameter) is 1 but 255 when nbits >2? I'm expecting codes.max to be exactly 2**nbits.
mdouze commented 1 year ago

The codes are packed, see #2285

gitgithan commented 1 year ago

Thanks that explains all my shape problems.

code_size vs M

On code_size, i see inconsistent interpretations at https://github.com/facebookresearch/faiss/wiki/Faiss-indexes. In the PQ row and Bytes/Vector column of the Indexes table, there is ceil(M * nbit / 8). This contributed to me raising this issue, thinking code_size should be called M. I also agree with the use of code_size in this statement Flat indexes just encode the vectors into codes of a fixed size and store them in an array of ntotal * code_size bytes. However later down that page, index = faiss.IndexIVFPQ (coarse_quantizer, d, ncentroids, code_size, 8) I believe code_size here should be M instead.

https://github.com/facebookresearch/faiss/wiki/FAQ#is-pq-encoding-with-more-or-less-than-8-bits-per-components-supported. I found the clearest disambiguation of code size and M in this statement on this page: The code size of the M PQ indices is rounded up to a whole number of bytes, ie. PQ3x4 uses 2 bytes;

Verifying BitstringReader

To verify what BitstringReader is doing, i tried doing my own slicing of bytes by joining the packed codes from faiss and splitting every nbits, and expect to see the same numbers produced from BitstringReader and my code:

def convert_codes(codes, nbits):
    print('First Code: ', codes[0])

    binary = [f'{c:08b}' for c in codes[0]]  # always printing first code only as example
    print('Binary: ',binary)

    joined_binary = ''.join(binary)
    split_binary = [joined_binary[i:i+nbits] for i in range(0, len(joined_binary), nbits)]
    print('Split Binary: ',split_binary)

    return [int(b,2) for b in split_binary]

convert_codes(codes,nbits)
bs = faiss.BitstringReader(faiss.swig_ptr(codes[0]), codes.shape[1])
for i in range(cs): 
    print(bs.read(nbits))

Seeded random inputs

d = 32  # data dimension
cs = 4  # code size (bytes)
nbits = 8
# train set 
nt = 10000

np.random.seed(0)
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

pq = faiss.ProductQuantizer(d, cs, nbits)
pq.train(xt)

What's strange is depending on nbits:

nbits = 8 i match faiss output

First Code:  [161 140   7 114]
Binary:  ['10100001', '10001100', '00000111', '01110010']
Split Binary:  ['10100001', '10001100', '00000111', '01110010']
[161, 140, 7, 114]
161
140
7
114

nbits = 4 i match the values but wrong order (Not sure what 2nd parameter of faiss.BitstringReader is doing?)

First Code:  [ 89 246]
Binary:  ['01011001', '11110110']
Split Binary:  ['0101', '1001', '1111', '0110']
[5, 9, 15, 6]
9
5
6
15

nbits = 3 i get completely different (both length and value) codes

First Code:  [82 15]
Binary:  ['01010010', '00001111']
Split Binary:  ['010', '100', '100', '000', '111', '1']
[2, 4, 4, 0, 7, 1]
2
2
5
7

What explains the previous 3 scenarios? I feel it got to do with whether i prepend 0s or append 0s for values < 128 that don't take up a full 8 bits, but considering this I still can't explain above patterns.

Why pq.compute_codes?

Also in what cases do users use the codes from pq.compute_codes(x)? I only see them used in pq.decode(codes) so far. They are unintuitive because the numbers don't indicate which k-th centroid is in which segment , so why not directly expose the outputs of BitstringReader? Knowing which k-th centroid is in which segment would help make use of results from faiss.vector_to_array(pq.centroids).reshape(pq.M, pq.ksub, pq.dsub) to find relationships between each database vector and the all centroids from all segments.

Wiki improvements

Could the wiki have some comments added so new readers wouldn't face these same questions.