facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.34k stars 3.64k forks source link

Have max_codes consider only subset entries in IndexIVF search #2649

Open wro-ableton opened 1 year ago

wro-ableton commented 1 year ago

Summary

Hey! Nice work with V1.7.3! I have a feature request.

Is there a way to have the max_codes stop criteria in IndexIVF searches only consider those entries that actually belong to a subset if one is specified? Wit the current implementation, to my understanding, when the number of scanned entries reaches max_codes, the search is stopped. However, for subset searches, this might happen before we actually scanned max_codes entries in the subset as even entries not in the subset count towards this limit.

Specifically, just as a proof of concept, all that would be necessary is to have scan_one_list not return list_size but instead return the number returned by scanner->scan_codes(list_size, codes, ids, simi, idxi, k); a few lines above. See here.

Obviously, that's just a quick hack only for the IVF index. 🙃 I assume in order to not break the current behavior, this would need to be controlled via an additional search parameter for all indices that have the same behavior currently.

Faiss version: 1.7.3 (19f7696deedc93615c3ee0ff4de22284b53e0243)

Running on:

Interface:

mdouze commented 1 year ago

max_codes is intended to limit the number of distance computations performed, not the number of results.

wro-ableton commented 1 year ago

I get that but currently, max_codes even considers entries outside the subset for which no distances are computed. So when we try to keep with this intended use, shouldn't max_codes ignore entries for which we don't compute distances?

But I see why my suggestion wouldn't be sufficient. scan_codes doesn't return the number of visited entries but the number of updates made to the results. So even with my suggestion, we would need to have scan_codes report how many distances were actually computed.

Actually, now looking at the implementation, I think this behaviour is inconsistent between generic IDSelectors and IDSelectorRange.

To my understanding, for IDSelectorRange, max_codes would only consider entries that were visited if they actually belong to that range selector. In this line, we update the list_size to only include entries which belong to the range subset. However, in the same lambda, for any other selector, which would take effect in scan_codes, we do not update list_size to only include subset entries. So for a range selector, we would have max_codes consider only distance computations within the subset, for any other selector, we would have it consider all entries.

Regardless, I'm not arguing that there has to be a change to how max_codes work. But with the added subset search feature, the API basically doesn't provide a efficient way to ensure that all entries in the subset are scanned. The only way to achieve this currently, to my understanding, is to neither restrict max_codes nor nprobe, in which case we would run an exhaustive search that would not stop even after all subset entries were visited.

wro-ableton commented 1 year ago

Hey @mdouze, could you give this ticket another look? I'd really appreciate it! If you think this is not a bug but expected behavior, maybe we could talk about extending the library to provide a way of limiting the number of visited entries in a subset search.