apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

count aggregation optimization inside one segment in log scenario [LUCENE-10425] #11461

Open asfimport opened 2 years ago

asfimport commented 2 years ago

In log scenario, we usually want to know the doc count of documents between every time intervals. One possible optimized method is to sort the docuemt in ascend order according to @timestamp field in one segment. then we can use    this pr https://github.com/apache/lucene/pull/687 to find out the min/max docId in on time interval.

If there is no other filter query, the doc count of one time interval is (max docId- min docId +1)

if there is only one another term filter query, we can use this pr https://github.com/apache/lucene/pull/688 to get the diff value of index, when we call advance(minId) and advance(maxId), the diff value is also the doc count of one time interval


Migrated from LUCENE-10425 by jianping weng (@wjp719), updated Mar 29 2022

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I'm not sure #687 actually helps compared to what we are already doing.

I like the idea of being able to count the number of hits of conjunctions efficiently thanks to index sorting but I think we'll need to expose this using a better API.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This would require a new API on PostingsEnum so I'll write what I think the applications of this change are to make sure I get the benefits correctly: If the field is sorted by a numeric field, Lucene could efficiently compute range facets (and special forms of range facets like histograms) for this numeric field as long as there are no deletions and the query has a single term. For instance, an index containing logs and sorted by timestamp could very efficiently compute an histogram of the timestamp field given any term query. To use an example from a different use-case, an index of an e-commerce catalog sorted by price could compute a histogram of prices very efficiently for any term query.

This feels quite powerful. The main thing that annoys me a bit is that it only works on the primary sort field, so we'd be adding an API for PostingsEnum for something that requires a very careful setup of the index as their can be a single primary sort field. I wonder if #11432 could help this optimization more often applicable, e.g. to logs indices sorted by host then timestamp, or to e-commerce indices sorted by category then price. Having this optimization more generally applicable would make me feel better about increasing the surface area of PostingsEnum. At first sight, it feels like this should work? Maybe this use-case would also help figure out what the API should be on #11432.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

seems far too specialized, similar to #11432.

asfimport commented 2 years ago

jianping weng (@wjp719) (migrated from JIRA)

> I'm not sure #687 actually helps compared to what we are already doing.

@jpountz Hi, is #687 ok now? it can speed up getting min/max doc Id when index sort ascend enabled instead of using docValue binary search

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Thinking through this a bit more, I imagine that this is the reason why you made this API optional.

Once you implement this API, if it is indeed useful, someone starts relying on it. Then you can no longer just go ahead and make a new implementation that doesn't support it (or makes it much slower), at least not without a fight: it's harder to remove something than not to add it in the first place. I don't think the optional idea is real.

We need to make a case for this on the merits - if it enables powerful optimizations across a wide range of applications, then so be it, perhaps it's worth the future constraints. I'm not sure though