Closed jasperhyp closed 1 year ago
Random embeddings are not appropriate for testing the accuracy of approximate knn methods
I advise you compute metrics on some real embeddings.
On Sun, Dec 4, 2022, 18:29 Yepeng @.***> wrote:
Hi! Thank you for the great library, it helped me a lot. I am so ignorant but I just wanted to pick your brain and see if my recall is reasonable. I have a training set of ~1M embeddings and I set the max query time limit to 10ms (cuz I would need to query it 200k times during my model training). I also set RAM to 20GB, tho I have more available memory slightly (but no larger than 100GB). The @.*** I'm seeing now is incredibly low, only ~0.1! Did I do anything wrong?
My code for testing is:
from autofaiss import build_indeximport numpy as npimport osimport shutilimport faiss max_index_query_time_ms = 10 @. {type: "number"}max_index_memory_usage = "20GB" @._type = "ip" @.*** ['ip', 'l2']D=480
Create embeddingsembeddings = normalize(np.float32(np.random.rand(100000, D)))
Create a new folderembeddings_dir = data_path + "/embeddings_folder"if os.path.exists(embeddings_dir):
shutil.rmtree(embeddings_dir)os.makedirs(embeddings_dir)
Save your embeddings# You can split you embeddings in several parts if it is too big# The data will be read in the lexicographical order of the filenamesnp.save(f"{embeddings_dir}/corpus_embeddings.npy", embeddings)
os.makedirs(data_path+"my_index_folder", exist_ok=True) build_index(embeddings=embeddings_dir, index_path=data_path+"knn.index", index_infos_path=data_path+"infos.json", metric_type=metric_type, max_index_query_time_ms=max_index_query_time_ms, max_index_memory_usage=max_index_memory_usage, make_direct_map=False, use_gpu=True) temp1 = np.random.randn(1024, D).astype(np.float32)temp2 = embeddings index = faiss.read_index(str(data_path+"knn.index"), faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)# index.nprobe=64start = timeit.default_timer()values, neighbors_q = index.search(normalize(temp1), 20)end = timeit.default_timer()print(end - start)print(sorted(neighbors_q[0])) temp = normalize(temp1, axis=1) @ normalize(embeddings, axis=1).Ttopk_indices_normalize = np.argpartition(temp, kth=temp.shape[1]-20, axis=1)[:, -20:]print(sorted(topk_indices_normalize[0]))
— Reply to this email directly, view it on GitHub https://github.com/criteo/autofaiss/issues/142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437VRSBYQOZAU3KQ36GLWLTIG5ANCNFSM6AAAAAASTPDCHQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks Romain! I haven't gotten the real data yet, but do you think it will give reasonable recall scores when setting {10ms, 20GB} for about 10M 480-dim embeddings? I probably can't afford more time lag :(
10M 480-dim embeddings is 1010^6480*4/(10^9) = 19.2GB in float32 so you wouldn't even need quantization, hnsw alone can work for you (and get you > 80% recall)
but I still recommend you use autofaiss, quantization will reduce the space needed and the recall will probably be pretty good
On Sun, Dec 4, 2022 at 9:04 PM Yepeng @.***> wrote:
Thanks Romain! I haven't gotten the real data yet, but do you think it will give reasonable recall scores when setting {10ms, 20GB} for about 10M 480-dim embeddings? I probably can't afford more time lag :(
— Reply to this email directly, view it on GitHub https://github.com/criteo/autofaiss/issues/142#issuecomment-1336504703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XPTSKGN4NMFWDNVMDWLT2LBANCNFSM6AAAAAASTPDCHQ . You are receiving this because you commented.Message ID: @.***>
Thank you!
Hi! Thank you for the great library, it helped me a lot. I am so ignorant but I just wanted to pick your brain and see if my recall is reasonable. I have a training set of
~1M
embeddings and I set the max query time limit to10ms
(cuz I would need to query it 200k times during my model training). I also set RAM to20GB
, tho I have more available memory slightly (but no larger than100GB
). The recall@20 I'm seeing now is incredibly low, only~0.1
! Did I do anything wrong?My code for testing is: