facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.56k stars 3.56k forks source link

some problem about fastscan #3792

Closed Tickdack closed 2 weeks ago

Tickdack commented 3 weeks ago

Summary

i make three index with faiss Cpp to find the most accurate index, to be as quickly as possible, i use python to test the index accuracy, I use flat as the baseline and ivfpq, fastscan as the be tested, but i find the fastscan accuracy is very very low

Platform

Running on:

Interface:

Reproduction instructions

Here is my test code:

    #  baseline res use flat
    b_idx = read_idx(baseline_idx_path)
    D, I = b_idx.search(query_vecs, recall_num)
    b_rets_ids = I.tolist()

    # generate flat res id-map
    baseline_map = {}
    for i in range(len(b_rets_ids)):
        baseline_map[i] = {}
        for ret_id in b_rets_ids[i]:
            baseline_map[i][ret_id] = 1

    # use ivfpq & scann search and print acc
    print("Top{} recall acc rate:".format(recall_num))
    for idx_path in idx_paths:
        m_idx = read_idx(idx_path)
        start = time.time()
        m_idx.nprobe = 20
        D, I = m_idx.search(query_vecs, recall_num)
        end = time.time()
        print("time: {}".format(end -start))
        m_rets_ids = I.tolist()

        recall_contain = 0
        sum_recall = len(I) * len(I[0])
        for i in range(len(m_rets_ids)):
            for m_ret_id in m_rets_ids[i]:
                if m_ret_id in baseline_map[i]:
                    recall_contain += 1

        print(idx_path + ':' + str(float(recall_contain) / sum_recall * 100) + '%')

here is my test ret:

Top10000 recall acc rate:
time: 0.14550495147705078
384_IVF20480_PQ96x8_IP.index:79.19808510638298%
time: 0.01808476448059082
384_IVF20480_PQ96x4fs_IP.index:0.7340425531914894%
alexanderguzhva commented 3 weeks ago

in practice, ivfpqfastscan must be used with refining

Tickdack commented 3 weeks ago

in practice, ivfpqfastscan must be used with refining

thank you very much

mdouze commented 3 weeks ago

note also that Fast scan in this example uses 2x less memory (96x4 bits vs. 96x8 bits) so for a fair comparison, the fast-scan version should be built with PQ192x4fs

WyLmYYA commented 3 weeks ago

May I ask if you are willing to share the adjusted results and whether fastscan will perform better?

Tickdack commented 3 weeks ago

note also that Fast scan in this example uses 2x less memory (96x4 bits vs. 96x8 bits) so for a fair comparison, the fast-scan version should be built with PQ192x4fs

thank you! with your suggestions, I have improved its performance a lot. How can I further improve its accuracy?

Top10000 recall acc rate:
time: 0.0404963493347168
384_IVF20480_PQ96x8_IP.index:65.33666666666666%
time: 0.09140467643737793
384_IVF30720_PQ192X4fs_IP.index:65.2388888888889%
Tickdack commented 2 weeks ago

May I ask if you are willing to share the adjusted results and whether fastscan will perform better?

yes, you can speak chinese, my folk

Tickdack commented 2 weeks ago

note also that Fast scan in this example uses 2x less memory (96x4 bits vs. 96x8 bits) so for a fair comparison, the fast-scan version should be built with PQ192x4fs

Besides performance, I have another question, when i use fastscan, i find it return some repeating ids, and some time i find it returns -1 as result, i have adjusted some parameters, as nprobe and implem, but it not work, can you give me some suggestions? thank you!

mdouze commented 2 weeks ago

-1 is possible, see https://github.com/facebookresearch/faiss/wiki/FAQ#what-does-it-mean-when-a-search-returns--1-ids repeating ids should not happen. If you encounter them, please fill a bug report.

Tickdack commented 2 weeks ago

-1 is possible, see https://github.com/facebookresearch/faiss/wiki/FAQ#what-does-it-mean-when-a-search-returns--1-ids repeating ids should not happen. If you encounter them, please fill a bug report.

thank you, sir