Hi! I want to get cosine similarity for vectors. I expect, that found vectors dist will be close to 1 (smth like 0.99), but I get 0.1.
Here is the code and output. Ids are right, but dist is small.
Platform
OS: Windows 11
Faiss version: 1.7.2
Installed from: pip
Faiss compilation options:
Running on:
[v] CPU
[ ] GPU
Interface:
[ ] C++
[V] Python
Reproduction instructions
import numpy as np
import faiss
from faiss import normalize_L2
dim = 512 # dimension
nb = 65536 # size of dataset
np.random.seed(228)
vectors = np.random.random((nb, dim)).astype('float32')
query = vectors[:5]
ids = np.array(range(0, nb)).astype(np.int64)
M = 64
D = M 4
clusters = 4096 # ~16math.sqrt(nb)
vector_size = D 4 + M 2 4
total_size_gb = round(vector_sizenb/(1024**3), 2)
factory = f"IDMap,OPQ{M}_{D},IVF{clusters}_HNSW32,PQ{M}"
print(f"factory: {factory}, {vector_size} bytes per vector, {total_size_gb} gb total")
faiss.omp_set_num_threads(10)
index = faiss.index_factory(dim, factory, faiss.METRIC_INNER_PRODUCT)
normalize_L2(vectors)
index.train(vectors)
print(f'Index trained')
index.add_with_ids(vectors, ids)
print(f'{index.ntotal} vectors have been added to index')
k = 1
nprobe = 1
normalize_L2(query)
index.nprobe = nprobe
dist, idx = index.search(query, k)
print(idx)
print(dist)
OUTPUT:
factory: IDMap,OPQ64_256,IVF4096_HNSW32,PQ64, 1536 bytes per vector, 0.09 gb total
Index trained
65536 vectors have been added to index
[[0]
[1]
[2]
[3]
[4]]
[[0.11132257]
[0.13959643]
[0.13129388]
[0.12439864]
[0.1243098 ]]
Summary
Hi! I want to get cosine similarity for vectors. I expect, that found vectors dist will be close to 1 (smth like 0.99), but I get 0.1. Here is the code and output. Ids are right, but dist is small.
Platform
OS: Windows 11
Faiss version: 1.7.2
Installed from: pip
Faiss compilation options:
Running on:
Interface:
Reproduction instructions
import numpy as np import faiss from faiss import normalize_L2 dim = 512 # dimension nb = 65536 # size of dataset np.random.seed(228) vectors = np.random.random((nb, dim)).astype('float32') query = vectors[:5] ids = np.array(range(0, nb)).astype(np.int64) M = 64 D = M 4 clusters = 4096 # ~16math.sqrt(nb) vector_size = D 4 + M 2 4 total_size_gb = round(vector_sizenb/(1024**3), 2) factory = f"IDMap,OPQ{M}_{D},IVF{clusters}_HNSW32,PQ{M}" print(f"factory: {factory}, {vector_size} bytes per vector, {total_size_gb} gb total") faiss.omp_set_num_threads(10) index = faiss.index_factory(dim, factory, faiss.METRIC_INNER_PRODUCT) normalize_L2(vectors) index.train(vectors) print(f'Index trained') index.add_with_ids(vectors, ids) print(f'{index.ntotal} vectors have been added to index') k = 1 nprobe = 1 normalize_L2(query) index.nprobe = nprobe dist, idx = index.search(query, k) print(idx) print(dist)
OUTPUT: factory: IDMap,OPQ64_256,IVF4096_HNSW32,PQ64, 1536 bytes per vector, 0.09 gb total Index trained 65536 vectors have been added to index [[0] [1] [2] [3] [4]] [[0.11132257] [0.13959643] [0.13129388] [0.12439864] [0.1243098 ]]