Is my low recall reasonable?

jasperhyp commented 1 year ago

Hi! Thank you for the great library, it helped me a lot. I am so ignorant but I just wanted to pick your brain and see if my recall is reasonable. I have a training set of ~1M embeddings and I set the max query time limit to 10ms (cuz I would need to query it 200k times during my model training). I also set RAM to 20GB, tho I have more available memory slightly (but no larger than 100GB). The recall@20 I'm seeing now is incredibly low, only ~0.1! Did I do anything wrong?

My code for testing is:

from autofaiss import build_index
import numpy as np
import os
import shutil
import faiss

max_index_query_time_ms = 10 #@param {type: "number"}
max_index_memory_usage = "20GB" #@param
metric_type = "ip" #@param ['ip', 'l2']
D=480

# Create embeddings
embeddings = normalize(np.float32(np.random.rand(100000, D)))

# Create a new folder
embeddings_dir = data_path + "/embeddings_folder"
if os.path.exists(embeddings_dir):
    shutil.rmtree(embeddings_dir)
os.makedirs(embeddings_dir)

# Save your embeddings
# You can split you embeddings in several parts if it is too big
# The data will be read in the lexicographical order of the filenames
np.save(f"{embeddings_dir}/corpus_embeddings.npy", embeddings) 

os.makedirs(data_path+"my_index_folder", exist_ok=True)

build_index(embeddings=embeddings_dir, index_path=data_path+"knn.index", 
            index_infos_path=data_path+"infos.json", 
            metric_type=metric_type, 
            max_index_query_time_ms=max_index_query_time_ms,
            max_index_memory_usage=max_index_memory_usage, 
            make_direct_map=False, use_gpu=True)

temp1 = np.random.randn(1024, D).astype(np.float32)
temp2 = embeddings

index = faiss.read_index(str(data_path+"knn.index"), faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)
# index.nprobe=64
start = timeit.default_timer()
values, neighbors_q = index.search(normalize(temp1), 20)
end = timeit.default_timer()
print(end - start)
print(sorted(neighbors_q[0]))

temp = normalize(temp1, axis=1) @ normalize(embeddings, axis=1).T
topk_indices_normalize = np.argpartition(temp, kth=temp.shape[1]-20, axis=1)[:, -20:]
print(sorted(topk_indices_normalize[0]))

rom1504 commented 1 year ago

Random embeddings are not appropriate for testing the accuracy of approximate knn methods

I advise you compute metrics on some real embeddings.

On Sun, Dec 4, 2022, 18:29 Yepeng @.***> wrote:

Hi! Thank you for the great library, it helped me a lot. I am so ignorant but I just wanted to pick your brain and see if my recall is reasonable. I have a training set of ~1M embeddings and I set the max query time limit to 10ms (cuz I would need to query it 200k times during my model training). I also set RAM to 20GB, tho I have more available memory slightly (but no larger than 100GB). The @.*** I'm seeing now is incredibly low, only ~0.1! Did I do anything wrong?

My code for testing is:

from autofaiss import build_indeximport numpy as npimport osimport shutilimport faiss max_index_query_time_ms = 10 @. {type: "number"}max_index_memory_usage = "20GB" @._type = "ip" @.*** ['ip', 'l2']D=480

Create embeddingsembeddings = normalize(np.float32(np.random.rand(100000, D)))

Create a new folderembeddings_dir = data_path + "/embeddings_folder"if os.path.exists(embeddings_dir):
shutil.rmtree(embeddings_dir)os.makedirs(embeddings_dir)
Save your embeddings# You can split you embeddings in several parts if it is too big# The data will be read in the lexicographical order of the filenamesnp.save(f"{embeddings_dir}/corpus_embeddings.npy", embeddings)

os.makedirs(data_path+"my_index_folder", exist_ok=True) build_index(embeddings=embeddings_dir, index_path=data_path+"knn.index", index_infos_path=data_path+"infos.json", metric_type=metric_type, max_index_query_time_ms=max_index_query_time_ms, max_index_memory_usage=max_index_memory_usage, make_direct_map=False, use_gpu=True) temp1 = np.random.randn(1024, D).astype(np.float32)temp2 = embeddings index = faiss.read_index(str(data_path+"knn.index"), faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)# index.nprobe=64start = timeit.default_timer()values, neighbors_q = index.search(normalize(temp1), 20)end = timeit.default_timer()print(end - start)print(sorted(neighbors_q[0])) temp = normalize(temp1, axis=1) @ normalize(embeddings, axis=1).Ttopk_indices_normalize = np.argpartition(temp, kth=temp.shape[1]-20, axis=1)[:, -20:]print(sorted(topk_indices_normalize[0]))

— Reply to this email directly, view it on GitHub https://github.com/criteo/autofaiss/issues/142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437VRSBYQOZAU3KQ36GLWLTIG5ANCNFSM6AAAAAASTPDCHQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jasperhyp commented 1 year ago

Thanks Romain! I haven't gotten the real data yet, but do you think it will give reasonable recall scores when setting {10ms, 20GB} for about 10M 480-dim embeddings? I probably can't afford more time lag :(

rom1504 commented 1 year ago

10M 480-dim embeddings is 1010^6480*4/(10^9) = 19.2GB in float32 so you wouldn't even need quantization, hnsw alone can work for you (and get you > 80% recall)

but I still recommend you use autofaiss, quantization will reduce the space needed and the recall will probably be pretty good

On Sun, Dec 4, 2022 at 9:04 PM Yepeng @.***> wrote:

Thanks Romain! I haven't gotten the real data yet, but do you think it will give reasonable recall scores when setting {10ms, 20GB} for about 10M 480-dim embeddings? I probably can't afford more time lag :(

— Reply to this email directly, view it on GitHub https://github.com/criteo/autofaiss/issues/142#issuecomment-1336504703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XPTSKGN4NMFWDNVMDWLT2LBANCNFSM6AAAAAASTPDCHQ . You are receiving this because you commented.Message ID: @.***>

jasperhyp commented 1 year ago

Thank you!

criteo / autofaiss

Is my low recall reasonable? #142

Create embeddingsembeddings = normalize(np.float32(np.random.rand(100000, D)))

Create a new folderembeddings_dir = data_path + "/embeddings_folder"if os.path.exists(embeddings_dir):

Save your embeddings# You can split you embeddings in several parts if it is too big# The data will be read in the lexicographical order of the filenamesnp.save(f"{embeddings_dir}/corpus_embeddings.npy", embeddings)