Open wro-ableton opened 1 year ago
max_codes
is intended to limit the number of distance computations performed, not the number of results.
I get that but currently, max_codes
even considers entries outside the subset for which no distances are computed. So when we try to keep with this intended use, shouldn't max_codes
ignore entries for which we don't compute distances?
But I see why my suggestion wouldn't be sufficient. scan_codes
doesn't return the number of visited entries but the number of updates made to the results. So even with my suggestion, we would need to have scan_codes
report how many distances were actually computed.
Actually, now looking at the implementation, I think this behaviour is inconsistent between generic IDSelector
s and IDSelectorRange
.
To my understanding, for IDSelectorRange
, max_codes
would only consider entries that were visited if they actually belong to that range selector. In this line, we update the list_size
to only include entries which belong to the range subset. However, in the same lambda, for any other selector, which would take effect in scan_codes
, we do not update list_size
to only include subset entries. So for a range selector, we would have max_codes
consider only distance computations within the subset, for any other selector, we would have it consider all entries.
Regardless, I'm not arguing that there has to be a change to how max_codes
work. But with the added subset search feature, the API basically doesn't provide a efficient way to ensure that all entries in the subset are scanned. The only way to achieve this currently, to my understanding, is to neither restrict max_codes
nor nprobe
, in which case we would run an exhaustive search that would not stop even after all subset entries were visited.
Hey @mdouze, could you give this ticket another look? I'd really appreciate it! If you think this is not a bug but expected behavior, maybe we could talk about extending the library to provide a way of limiting the number of visited entries in a subset search.
Summary
Hey! Nice work with V1.7.3! I have a feature request.
Is there a way to have the
max_codes
stop criteria inIndexIVF
searches only consider those entries that actually belong to a subset if one is specified? Wit the current implementation, to my understanding, when the number of scanned entries reachesmax_codes
, the search is stopped. However, for subset searches, this might happen before we actually scannedmax_codes
entries in the subset as even entries not in the subset count towards this limit.Specifically, just as a proof of concept, all that would be necessary is to have
scan_one_list
not returnlist_size
but instead return the number returned byscanner->scan_codes(list_size, codes, ids, simi, idxi, k);
a few lines above. See here.Obviously, that's just a quick hack only for the IVF index. 🙃 I assume in order to not break the current behavior, this would need to be controlled via an additional search parameter for all indices that have the same behavior currently.
Faiss version: 1.7.3 (19f7696deedc93615c3ee0ff4de22284b53e0243)
Running on:
Interface: