Closed timgrein closed 1 month ago
@timgrein could you determine if the scores the same or not? I wonder if we are getting tripped up by doc IDs being the tie breaker for equal scores.
@benwtrent
Without increasing k
we'll get the following for the failing test instance:
TOP 1 docs:
Document<stored<id:23>> 9.601536E-5
Document<stored<id:119>> 7.3713694E-5
Document<stored<id:163>> 7.087675E-5
Document<stored<id:148>> 7.051192E-5
Document<stored<id:51>> 6.879472E-5
TOP 2 docs:
Document<stored<id:23>> 9.601536E-5
Document<stored<id:193>> 8.53898E-5
Document<stored<id:119>> 7.3713694E-5
Document<stored<id:163>> 7.087675E-5
Document<stored<id:148>> 7.051192E-5
(So it seems like the first/unsorted index doesn't find document 193
, when k
is too small (60
); I guess due to the different index structure?)
Increasing k
to 80
leads to the following results for the previously failing test instance:
TOP 1 docs:
Document<stored<id:23>> 9.601536E-5
Document<stored<id:193>> 8.53898E-5
Document<stored<id:119>> 7.3713694E-5
Document<stored<id:163>> 7.087675E-5
Document<stored<id:148>> 7.051192E-5
TOP 2 docs:
Document<stored<id:23>> 9.601536E-5
Document<stored<id:193>> 8.53898E-5
Document<stored<id:119>> 7.3713694E-5
Document<stored<id:163>> 7.087675E-5
Document<stored<id:148>> 7.051192E-5
@timgrein what is the beamwidth set to in the failing case?
We may want to increase the beamWidth size to just make the test more consistent.
int beamWidth = random().nextInt(10) + 10; // from the previous base of 5
@benwtrent The beam width for the failing test case was the smallest value possible 5
. Increased the minimum to 10
according to your suggestion (which also fixes the test without increasing k
). Do we still want to keep the increased k
? Increasing it didn't seem to affect execution time too much at least for this test instance and it should probably reduce the likelihood of a flaky fail even further.
Do we still want to keep the increased k?
I would rather not, we keep bumping it up, eventually we are going to stop searching in the graph altogether and just brute force, which ruins the reason for the test.
eventually we are going to stop searching in the graph altogether and just brute force, which ruins the reason for the test
Makes sense, decreased k
again to 60
👍
Closes https://github.com/apache/lucene/issues/13210
Description
The following test failed as it produced two different lists of ids for a sorted and unsorted HNSW byte vector graph as one graph didn't find a higher scoring doc the other one found:
gradlew test --tests TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults -Dtests.seed=B41BEC5619361A16 -Dtests.locale=hi-IN -Dtests.timezone=Atlantic/Stanley -Dtests.asserts=true -Dtests.file.encoding=UTF-8
Considering that the graphs of 2 indices are organized differently we need to explore a lot of candidates to ensure that both searchers find the same docs. Increasing
beamWidth
(number of nearest neighbor candidates to track while searching the graph for each newly inserted node) from5
to10
fixes the test.