facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.43k stars 3.64k forks source link

Question: relationship between PCA and PQ parameters? Trying to get intuition... #248

Closed billkle1n closed 6 years ago

billkle1n commented 7 years ago

Sorry, this is not really an issue, more a theoretical question.

Let's say that I have a 2048D vector (for the sake of argument, let's assume that the vector is uint8 data type) and want to reduce that representation down to 16 bytes. On one end of the scale, I could apply PCA (or other dimensionality reduction technique) to get 16 dimensions. On the other end, I could use PQ directly on the 2048D vectors and obtain 16 byte codes.

There is probably a sweet spot between those two extremes (e.g. PCA to 128D and then PQ). That sweet spot could probably be found through parameter search and testing on a validation set.

What bothers me is I have zero intuition for what those parameters might be.

mdouze commented 7 years ago

Hi,

A practical answer: as a rule of thumb for a code of size c, first apply a OPQ transform (not PCA) to 4*c or 8*c, then encode with PQ.

opq = OPQMatrix(2048, 16, 4*16) 
pq = ProductQuantizer(4*16, 16, 8) 
opq.train(x)
xt = opq.apply_py(x)
pq.train(xt)
codes = pq.compute_codes(xt)

To decode:

x_decoded = opq.reverse_transform(pq.decode(codes))
mdouze commented 6 years ago

Closing.