apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.51k stars 995 forks source link

Add a mock/test codec that redirects all doc values formats to BINARY [LUCENE-9205] #10245

Open asfimport opened 4 years ago

asfimport commented 4 years ago

We have some exciting proposed changes that will use BINARY doc values, e.g. using block compression in the default codec, supporting approximate nearest-neighbor search using HNSW or IVFFLAT, but I think our testing of BINARY doc values could be improved.

A possibly simple way to do that would be to make a mock/test codec that wraps the default codec but implements all doc values formats (NUMERIC, SORTED, etc.) on top of the wrapped codec's BINARY format. This way the wrapped codec only ever sees BINARY, and then all Lucene tests that test the other doc values codecs, would be testing BINARY.


Migrated from LUCENE-9205 by Michael McCandless (@mikemccand)

asfimport commented 4 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree the idea can work, but I'm not sure how much fun this would be to debug. If the bug is an encoding corner-case, it might not be that horrible, but if its trickier such as a misuse of some api, the additional indirection might not be fun?

On the other hand if I look at BaseDocValuesFormatTestCase, I don't feel like BinaryDV is severely under-represented? There are a lot of tests for binary dv here... we should add more if we can think of interesting corner cases, because these simple unit tests are the easiest to debug. I think accumulating these has the highest value.

IMO if we want to wrap a codec with anything, the easiest win would be to fix AssertingCodec to work with more than just the default codec. This thing tries to be simple to debug and will find api misuses and so on. Unfortunately today Asserting* is wired to only wrap the default codec and not anything else. And if you override some methods (such as docvalues) to test a different implementation, it returns that different implementation bare (without asserts), and just the rest of the index with asserts.

The tests in org.apache.lucene.codecs.asserting work around this situation, and test the default codec all over again, just with this wrapper enabled on what they are testing. But maybe we should delete these tests, and just always wrap all index structures with asserts, so that more than just the default codec are getting the checks?