Optimize early termination for sorted indices with terminate_after or similar

EricMCornelius commented 2 years ago

Use case: optimally show the most recent data from an index sorted by primary timestamp, where precise total counts do not matter

Currently, terminate_after is a shard-level count but has no guarantees about being applied across all segments in a shard. This means that, when fully merged to a single segment, we get the desired behavior above, but not when an index is actively written.

See discussion here with @jimczi - https://discuss.elastic.co/t/index-sorting-and-terminate-after-combination/198233

So, in the situation where, say, a user wants to show the 100 most recent documents, terminate_after usage can provide inconsistent results, regardless of the threshold setting.

In situations where terminate_after is necessary (e.g. for protection against expensive aggregations executed by a customer) - it would be useful to have an additional setting that utilizes the same early termination logic as here: https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html#early-terminate

I'm not sure of the best approach for implementation, but it seems like a broadly useful feature when dealing with time-sorted indices, to be able to restrict the total document count for aggregations while still applying to/returning the most recent records in that capped set of results.

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

nguyenvietyen commented 2 years ago

I believe I'm investigating the same desired effect.

I aim to achieve early termination of a large query in conjunctive normal form. Currently on ES 8.3.2 and enabled index sorting and followed the clues for using lower-cardinality sorted fields in https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html#early-terminate. When I profile the search requests (using "profile": true), I do not see any indication of early termination happening. It could be that I'm misinterpreting the profiling output. I am also not able to reproduce time savings in benchmarks.

Questions (for @EricMCornelius and/or the elastic search team):

How do you determine that early termination kicked in for a search query that should benefit from the presorted index?
I digged in the search code. I see references in QueryPhase.java and Lucene.java that early termination using index sorting only happens if a scroll window combined with search_after is used. Is that understanding correct?

EgbertW commented 1 year ago

Are there any plans to pick this up in the near future? I'm facing the same issue where terminate_after surprisingly enforces a limit per shard but is only consistent on sorting per segment. Either the limit should be per segment, or the document selection should be spread across the shards making sure the index sorting is maintaned. I'm guessing that a limit per segment would be the easiest to implement and with some simple math upfront you can determine what the count should be to achieve your intended goals.

Force merging to 1 segment works but takes way too long on huge indices.

EricMCornelius commented 1 year ago

@jimczi - has there been any internal discussion about this limitation? Just curious if we should be looking to implement it ourselves.

jimczi commented 1 year ago

The discussion you're referring to is quite old. Early termination for sorted indices has been added since version 7. See the section about early termination here. So you don't need to use terminate_after, you can use track_total_hits as an indicator that your search is eligible for early termination.

EgbertW commented 8 months ago

@jimczi I just checked up on this discussion and see it hasn't been addressed yet. Your remark is only partially true.

If you use track_total_hits on a sorted index, it only terminates early IF you sort on the same field the index is sorted on. If you sort on _score it doesn't work as will calculate the score for all documents before throwing out the remainder.

We have a field called popularity in our index. The index is sorted on this field (descendingly). When we do a search query, we match on keywords and boost the value by popularity. This is a big boost, so popularity actually is quite a decisive factor in the final ordering.

For a query using sort: [{"_score":"desc"}] it may take 1 or 2 seconds to reply when there's millions of matching documents, even with track_total_hits: 10000. This indicates that it is not terminating early. If I change it to sort: [{"popularity":"desc"}] I do get a response in 25ms indicating it did terminate early, utilizing the sorted index. But the results are now only sorted on popularity, not in the keyword matching score anymore. This is not desirable.

When setting terminate_after: 10000 instead (and track_total_hits: true), I do get results in 10-20 ms, even when sorting on _score. The limitation here is that terminate_after will visit any random segment to collect documents, with no guarantees that all segments are visited. So even on a sorted index, there is a change that a matching document with the highest value for popularity is not returned, unless a forge merge is done to 1 segment.

So the behaviour of track_total_hits and terminate_after is comparible but not equivalent and based on the use case they may not yield the same search results or performance.

jimczi commented 8 months ago

Thanks for explaining. @EgbertW

We have a field called popularity in our index. The index is sorted on this field (descendingly). When we do a search query, we match on keywords and boost the value by popularity. This is a big boost, so popularity actually is quite a decisive factor in the final ordering.

Using a non-static multiplicative factor will disable the ability to skip documents when track_total_hits is reached. That's why we generally advise to use additive scoring models, they unlock the potential of skipping documents based on maximum possible scores for the query. You could use the rank-feature query instead to add popularity to the overall score (you can set a high boost if you like as long as it is static for the entire query). That's one option that is worth testing, your index doesn't even need to be sorted to get the speedup (but it might help).

The other option as you're alluding is to consider a version of terminate_after that is applied at the segment level. I'd consider it as a new option rather than building on top of terminate_after to avoid confusion though.

EgbertW commented 8 months ago

@jimczi Thanks, those are some helpful insights. Based on your comments I investigated a bit further. Switching to a rank feature still doesn't help directly, I see the same query processing time. However, we also have a collapse part in the query to group documents based on a specific field. When I take the collapse part out of the query, I do see a similar effect when comparing smaller values for terminate_after and track_total_hits, so it does appear to work in that situation.

Of course, the collapse part is also a requirement in this setup so this doesn't help to get away from terminate_after unfortunately.

benwtrent commented 3 months ago

terminate_after is "best effort" and an approximation. While I realize this is a disappointing response, it's the truth.

I will do my best to respond in-line to some of the previous comments from various parts of this discussion.

For a query using sort: [{"_score":"desc"}] it may take 1 or 2 seconds to reply when there's millions of matching documents, even with track_total_hits: 10000

To sort on _score, we have to score the document. Depending on the query, this can take time and while we do our best to early terminate depending on MaxBlockWAND, it is possible that we have to score more than you expect to early terminate.

Sorting on the field that it is indexed by is basically constant time as things are already sorted, this is why its so much faster. We can skip entire segments where the field value is too low to even be considered in the results.

So even on a sorted index, there is a change that a matching document with the highest value for popularity is not returned, unless a forge merge is done to 1 segment.

This is intended.

I'm guessing that a limit per segment would be the easiest to implement and with some simple math upfront you can determine what the count should be to achieve your intended goals.

This simply will not work on any index that is being written to and if your index is never being written to, why not force-merge?

You don't have control over the number of segments, and they can vary in size. So any math you have calculated to determine what the terminate_after should be will be incorrect. Another “best effort” or approximation.

Of course, the collapse part is also a requirement in this setup so this doesn't help to get away from terminate_after unfortunately.

The runtime might be due to various factors. But with collapse, we do gather more docs. So it might just be the fact that more docs must be searched and scored.

All in all, we could add a new parameter to ensure a limit per segment, but I don’t know how this would even work with segment parallelism and such.

EricMCornelius commented 3 months ago

I believe the conversation is drifting away from the initial aim I had in filing it, which is to efficiently return the most recent time sorted documents from an index when we're already utilizing index time sorting. Seems a common use case with log data and metrics streams.

Force merging is not useful here given we're just trying to provide a quick diagnostic to end users of whether data is coming in successfully through a pipeline. Equivalent of a live tail chunk of data in the most efficient manner.

EgbertW commented 2 months ago

@benwtrent thanks for your response.

This simply will not work on any index that is being written to and if your index is never being written to, why not force-merge?

Because it takes long. Really long. A force merge to 4 segments is about 6 times faster than a force merge to just 1 segment in my situation. This is the difference between 1 hour of force merging and 10 minutes. Since this is a frequent situation in the setup, the time savings would be more than welcome.

But in my experiments I've noticed that even with just 2 segments, it can very well happen that the top scoring document is not returned when using terminate_after because it fills the response with results from the other segmenmt. If instead, it would round-robin starting at the top of any segment, or even better, track the highest score from all shards, it could more intelligently fill the list of results.

The runtime might be due to various factors. But with collapse, we do gather more docs. So it might just be the fact that more docs must be searched and scored.

Well, if we set inner-hits.size to 5 and it will keep going until it found 5 documents that it can collapse together then I can imagine this will delay the process as many of the documents do not have 5 matchind other documents.

If I set track_total_hits: 10000 or something, I would expect the server to stop collecting additional documents to collapse after 10000 and not look further, also not for documents to collapse. The resulting value for total is not entirely accurate in this case anyway, as documents are collapsed so the total number of outer hits is lower than this value and without doing a cardinality aggregation there is no way to get an accurate amount. So with this knowledge it would be feasible to find a proper tradeoff for a value for track_total_hits that gives enough accuracy and performance.

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elastic / elasticsearch

Optimize early termination for sorted indices with terminate_after or similar #82995