Closed jpountz closed 1 month ago
Can we do better than blindly prefetching skipdata? Currently, skipdata is not used until we advance() past doc in the first block:
Should we instead try to preload skipdata the first time advance() is actually called for the enum? This still happens before skipper might get used, because we might fullfill a few calls to advance() from the first block's docs, but I like that we have evidence now that skipper will most likely be needed. e.g. a one time invocation before this "if": https://github.com/apache/lucene/blob/83db8dfba30447dc154efbf98a46048ebe5743ef/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99PostingsReader.java#L502-L504
We might (eventually, later) consider an API change when pulling postings that expresses that the caller intends to advance? Or, maybe simpler would be expressing that caller will not use advance, not sure.
This is cool! In the hot case, do we expect prefetch
to be a no-op? So we are hoping for "first do no harm" in that case? (I haven't looked at MMapDirectory
's impl yet). But in the cold case, this could cause multiple prefetch calls within one segment if the query has multiple terms. And if cross-segment concurrency is enabled, multiple such calls across slices (thread work units) too.
But in the cold case this may be a big win?
This is cool! In the hot case, do we expect prefetch to be a no-op? So we are hoping for "first do no harm" in that case?
Yes, mostly. The benchmark I ran at https://github.com/apache/lucene/pull/13337#issuecomment-2095430556 suggests that there is a 1-2 us overhead for prefetching when data fits in the page cache. (Possibly more under concurrent search, I'd need to benchmark this as well.) So with 50 segments, 4 clauses and assuming 3 prefetch calls per clause (one in the terms dict, one for postings, one for skip data), this would give a total overhead of 3250*4 = 1200us = 1.2ms, which looks ok to me.
But in the cold case, this could cause multiple prefetch calls within one segment if the query has multiple terms. And if cross-segment concurrency is enabled, multiple such calls across slices (thread work units) too.
This is correct. We could also look into doing inter-segment I/O concurrency in the same thread in the future, e.g. by fetching all scorers first, and then evaluating these scorers against their segment, but I need to think more of the trade-off.
But in the cold case this may be a big win?
Yes! Hopefully with a few more changes I'll be able to confirm this by running a benchmark that runs actual Lucene queries. Early experiments suggest that there is a big win indeed.
This uses the
IndexInput#prefetch
API for postings. This relies on heuristics, as we don't know ahead of time what data we will need from a postings list:Positions, offsets and payloads are never prefetched.
Putting the
IndexInput#prefetch
call inTermsEnum#postings
andTermsEnum#impacts
works well becauseBooleanQuery
will first create postings/impacts enums for all clauses before it starts unioning/intersecting them. This allows the prefetching logic to run in parallel across all clauses of the same query on the same segment.