apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

Implement random access seeks in IndexedDISI (DocValues) [LUCENE-9051] #10093

Open asfimport opened 4 years ago

asfimport commented 4 years ago

In #10047 we have a use case for random-access seeking in DocValues, which currently only support forward-only iteration (with efficient skipping). One idea there was to write an entirely new format to cover these cases. While looking into that, I noticed that our current DocValues addressing implementation, IndexedDISI, already has a pretty good basis for providing random accesses. I worked up a patch that does that; we already have the ability to jump to a block, thanks to the jump-tables added last year by @tokee; the patch uses that, and/or rewinds the iteration within current block as needed.

I did a very simple performance test, comparing forward-only iteration with random seeks, and in my test I saw no difference, but that can't be right, so I wonder if we have a more thorough performance test of DocValues somwhere that I could repurpose. Probably I'll go back and dig into the issue where we added the jump tables - I seem to recall some testing was done then.

Aside from performance testing the implementation, there is the question should we alter our API guarantees in this way. This might be controversial, I don't know the history or all the reasoning behind the way it is today. We provide advanceExact and some implementations support docids going backwards, others don't.  AssertingNumericDocValues.advanceExact does  enforce forward-iteration (in tests); what would the consequence be of relaxing that? We'd then open ourselves up to requiring all DV impls to support random access. Are there other impls to worry about though? I'm not sure. I'd appreciate y'all's input on this one.


Migrated from LUCENE-9051 by Michael Sokolov (@msokolov), updated Nov 19 2019

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'm assuming we started by looking at doc-value formats in #10047 because that was convenient, but eventually we'd have a separate XXXFormat abstraction instead? Then you could fork IndexedDisi to not implement DocIdSetIterator anymore (which is the reason why it enforces sequential access) and support backward access too and use it as an implementation detail of XXXFormat?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Sure, but I'd like to understand the rationale for forking. It seems like we'd end up with a lot of unneccessary code duplication. Why do we see implementing DocIdSetIterator as preventing a class from also implementing random access, as DocValuesIterator seems to offer with its advanceExact?

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Code duplication might look unnecessary, but I also see benefits in having independent forks so that they can evolve according to their own constraints. For instance today's implementation of Lucene80's IndexedDISI might be close to your needs, but if we find a way to make it better for the access pattern that is typical to doc values, it would be a shame that it would slow down nearest-neighbor search or vice-versa. One could make the argument that we could delay the decision to fork until it's needed, but then it's an incentive against simple changes, e.g. reordering some loops or replacing a binary search with an exponential search would make the diff very large because of the need to duplicate IndexedDISI.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I guess it is a judgment call. Forking means that any bug fixes and other improvements that could be shared will also need to be duplicated. I can see it makes sense to fork if you think use cases will diverge, but I wonder whether we will find that is so. At least some optimizations for forward-only (jump tables) are also enabling fast random access.