facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.34k stars 3.64k forks source link

Cosine similarity is too small #2469

Closed jump155 closed 2 years ago

jump155 commented 2 years ago

Summary

Hi! I want to get cosine similarity for vectors. I expect, that found vectors dist will be close to 1 (smth like 0.99), but I get 0.1. Here is the code and output. Ids are right, but dist is small.

Platform

OS: Windows 11

Faiss version: 1.7.2

Installed from: pip

Faiss compilation options:

Running on:

Interface:

Reproduction instructions

import numpy as np import faiss from faiss import normalize_L2 dim = 512 # dimension nb = 65536 # size of dataset np.random.seed(228) vectors = np.random.random((nb, dim)).astype('float32') query = vectors[:5] ids = np.array(range(0, nb)).astype(np.int64) M = 64 D = M 4 clusters = 4096 # ~16math.sqrt(nb) vector_size = D 4 + M 2 4 total_size_gb = round(vector_sizenb/(1024**3), 2) factory = f"IDMap,OPQ{M}_{D},IVF{clusters}_HNSW32,PQ{M}" print(f"factory: {factory}, {vector_size} bytes per vector, {total_size_gb} gb total") faiss.omp_set_num_threads(10) index = faiss.index_factory(dim, factory, faiss.METRIC_INNER_PRODUCT) normalize_L2(vectors) index.train(vectors) print(f'Index trained') index.add_with_ids(vectors, ids) print(f'{index.ntotal} vectors have been added to index') k = 1 nprobe = 1 normalize_L2(query) index.nprobe = nprobe dist, idx = index.search(query, k) print(idx) print(dist)

OUTPUT: factory: IDMap,OPQ64_256,IVF4096_HNSW32,PQ64, 1536 bytes per vector, 0.09 gb total Index trained 65536 vectors have been added to index [[0] [1] [2] [3] [4]] [[0.11132257] [0.13959643] [0.13129388] [0.12439864] [0.1243098 ]]

mdouze commented 2 years ago

This is normal as the distances are approximate. If you increase the M or use SQ compression, the accuracy will improve.

jump155 commented 2 years ago

Thank you