Open gitgithan opened 1 year ago
The codes are packed, see #2285
Thanks that explains all my shape problems.
code_size
vs M
On code_size
, i see inconsistent interpretations at
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes.
In the PQ row and Bytes/Vector column of the Indexes table, there is ceil(M * nbit / 8)
. This contributed to me raising this issue, thinking code_size
should be called M
.
I also agree with the use of code_size
in this statement Flat indexes just encode the vectors into codes of a fixed size and store them in an array of ntotal * code_size bytes.
However later down that page, index = faiss.IndexIVFPQ (coarse_quantizer, d, ncentroids, code_size, 8)
I believe code_size
here should be M instead.
https://github.com/facebookresearch/faiss/wiki/FAQ#is-pq-encoding-with-more-or-less-than-8-bits-per-components-supported.
I found the clearest disambiguation of code size and M in this statement on this page:
The code size of the M PQ indices is rounded up to a whole number of bytes, ie. PQ3x4 uses 2 bytes;
To verify what BitstringReader is doing, i tried doing my own slicing of bytes by joining the packed codes from faiss and splitting every nbits, and expect to see the same numbers produced from BitstringReader and my code:
def convert_codes(codes, nbits):
print('First Code: ', codes[0])
binary = [f'{c:08b}' for c in codes[0]] # always printing first code only as example
print('Binary: ',binary)
joined_binary = ''.join(binary)
split_binary = [joined_binary[i:i+nbits] for i in range(0, len(joined_binary), nbits)]
print('Split Binary: ',split_binary)
return [int(b,2) for b in split_binary]
convert_codes(codes,nbits)
bs = faiss.BitstringReader(faiss.swig_ptr(codes[0]), codes.shape[1])
for i in range(cs):
print(bs.read(nbits))
Seeded random inputs
d = 32 # data dimension
cs = 4 # code size (bytes)
nbits = 8
# train set
nt = 10000
np.random.seed(0)
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, nbits)
pq.train(xt)
What's strange is depending on nbits:
nbits = 8 i match faiss output
First Code: [161 140 7 114]
Binary: ['10100001', '10001100', '00000111', '01110010']
Split Binary: ['10100001', '10001100', '00000111', '01110010']
[161, 140, 7, 114]
161
140
7
114
nbits = 4 i match the values but wrong order (Not sure what 2nd parameter of faiss.BitstringReader
is doing?)
First Code: [ 89 246]
Binary: ['01011001', '11110110']
Split Binary: ['0101', '1001', '1111', '0110']
[5, 9, 15, 6]
9
5
6
15
nbits = 3 i get completely different (both length and value) codes
First Code: [82 15]
Binary: ['01010010', '00001111']
Split Binary: ['010', '100', '100', '000', '111', '1']
[2, 4, 4, 0, 7, 1]
2
2
5
7
What explains the previous 3 scenarios? I feel it got to do with whether i prepend 0s or append 0s for values < 128 that don't take up a full 8 bits, but considering this I still can't explain above patterns.
Also in what cases do users use the codes from pq.compute_codes(x)
? I only see them used in pq.decode(codes)
so far. They are unintuitive because the numbers don't indicate which k-th centroid is in which segment , so why not directly expose the outputs of BitstringReader? Knowing which k-th centroid is in which segment would help make use of results from faiss.vector_to_array(pq.centroids).reshape(pq.M, pq.ksub, pq.dsub)
to find relationships between each database vector and the all centroids from all segments.
Could the wiki have some comments added so new readers wouldn't face these same questions.
I'm refering to the code at https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization#example-pq-encoding--decoding
The 2nd and 3rd inputs in
faiss.ProductQuantizer
were cs and 8, however when checking https://github.com/bitsun/faiss-windows/blob/master/ProductQuantizer.h#L27 , the 3 inputs are d, M, nbits.Problem
I think the 2nd parameter M is number of segments and not
cs
code size as expressed on the wikiOther related questions