Open gashutos opened 1 year ago
Lazily heapifying sounds interesting, and thanks for sharing performance numbers when data occurs in random order. Do you also have performance numbers for the case when the index sort is the opposite order compared to the query sort? I'm curious how much this optimization can save in that case since this is what you're trying to optimize.
We dont have benchmark for numeric sort in Lucene itself
Did you look at this task on nightly benchmarks? http://people.apache.org/~mikemccand/lucenebench/TermDTSort.html
You might also be interested in checking out this paper where Tencent describes optimizations that they made for a similar problem in section 4.5.2: they configure an index sort by ascending timestamp on their data, but still want to be able to perform both queries by ascending timestamp and descending timestamp. To handle the case when the index sort and the query sort are opposite, they query on exponentially growing windows of documents that match the end of the doc ID space.
Thanks ! @jpountz for looking at this.
Lazily heapifying sounds interesting, and thanks for sharing performance numbers when data occurs in random order. Do you also have performance numbers for the case when the index sort is the opposite order compared to the query sort? I'm curious how much this optimization can save in that case since this is what you're trying to optimize.
For this, I have inserted 36 million
documents with below schema, keeping @timestamp
long field in asc
order almost.
{
"@timestamp": 898459201,
"clientip": "211.11.9.0",
"request": "GET /english/index.html HTTP/1.0",
"status": 304,
"size": 0
}
Below were the results, units are in ms
, when I query top(K)
in descending order. where k = 5, 100, 1000....
before my changes, top(K) | pq time | total time | |
---|---|---|---|
5 hits | 4 | 58 | |
100 hits | 70 | 130 | |
1000 hits | 95 | 197 |
after my changes,
top(K) | pq time | total time | |
---|---|---|---|
5 hits | 0 | 47 | |
100 hits | 1 | 59 | |
1000 hits | 2 | 101 |
The higher the value of k
the higher heapifycation time is going to take.
You might also be interested in checking out this paper where Tencent describes optimizations that they made for a similar problem in section 4.5.2: they configure an index sort by ascending timestamp on their data, but still want to be able to perform both queries by ascending timestamp and descending timestamp. To handle the case when the index sort and the query sort are opposite, they query on exponentially growing windows of documents that match the end of the doc ID space.
Yeah, I have read this paper :) The main thing which I kind of felt bottleneck there is they use index sort
in timestamp field which leads higher indexing latencies. That's a trade off there.
I like our current implementation without index sort
and use BKD points to skip non-competitive hits.
And if we add this Leazy heapifycation
as proposed here, that will give good advantage on current implementation itself.
I have updated original issue description with performance gain with this optimization.
Thinking a bit more about this optimization, I wonder if it would still work well under concurrent indexing. If I understand the optimization correctly, it relies on the fact that the n-th collected document would generally have a more competitive value than the (n-k)-th collected document to keep inserting into the circular buffer. But this wouldn't be true, e.g. under concurrent indexing if flushing segments that have (k+1) docs or more?
For instance, assume two indexing threads that index 10 documents each between two consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 18 and the second segment could have timestamps 1, 3, 5, ..., 19. Then when they get merged, this would create a segment whose timestamps would be 0, 2, 4, ..., 18, 1, 3, 5, ..., 19. Now if you collect the top-5 hits by descending timestamp, the optimization would automatically disable itself when it has timestamps [10, 12, 14, 16, 18]
in the queue and sees timestamp 1
, since 1 < 10
?
Thinking a bit more about this optimization, I wonder if it would still work well under concurrent indexing. If I understand the optimization correctly, it relies on the fact that the n-th collected document would generally have a more competitive value than the (n-k)-th collected document to keep inserting into the circular buffer. But this wouldn't be true, e.g. under concurrent indexing if flushing segments that have (k+1) docs or more?
For instance, assume two indexing threads that index 10 documents each between two consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 18 and the second segment could have timestamps 1, 3, 5, ..., 19. Then when they get merged, this would create a segment whose timestamps would be 0, 2, 4, ..., 18, 1, 3, 5, ..., 19. Now if you collect the top-5 hits by descending timestamp, the optimization would automatically disable itself when it has timestamps [10, 12, 14, 16, 18] in the queue and sees timestamp 1, since 1 < 10?
Yes, this wont optimize scenario where concurrent flush is invoked with very less document say 10 documents per flush. Reading at this article concurrent flush (it might be old article and not the latest detail on concurrent flushing, and I hope you are talking about this as concurrent indexing), it looks like a single flish would still contain 1000s of document to be flushed if we talk about millions. This optimization would still work very well if this number of document per single flush is higher and the top(k) is smaller. i.e If single flush is 1000 and if we need top(100) in descending order, we will still able to skip 900 documents from going through heapifycation process.
For instance, assume two indexing threads that index 10 documents each between two consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 18 and the second segment could have timestamps 1, 3, 5, ..., 19. Then when they get merged, this would create a segment whose timestamps would be 0, 2, 4, ..., 18, 1, 3, 5, ..., 19. Now if you collect the top-5 hits by descending timestamp, the optimization would automatically disable itself when it has timestamps [10, 12, 14, 16, 18] in the queue and sees timestamp 1, since 1 < 10?
Like in the example you have given, lets modify it slightly like below,
assume two indexing threads that index 100 documents each between two consecutive refreshes. The first segment could have timestamps 0, 2, 4, ..., 198 and the second segment could have timestamps 1, 3, 5, ..., 199. Then when they get merged, this would create a segment whose timestamps would be 0, 2, 4, ..., 198, 1, 3, 5, ..., 199. Now if you collect the top-5 hits by descending timestamp
In this case,
So we ended up skipping 190 hit going through heapification process out of 200.
Just for explanation I didnt take skipping logic in consideration else in step 4, we will have all the hits from 1 to 189 would be non-comprtitive marked by BKD point based competitive iterator.
Thank you for reading this long explanation and I hope this makes clarity to your doubt.
I know it might not look like a very clean solution but desc sort order on timestamp field, specialy for logs/metrics scenario are very common and it gives sizable improvement to those queries.
@gashutos - Slightly orthogonal to the proposal. Given that the optimization is primarily based on the order of the documents in the segment, would it makes sense to reverse the problem. Can it be sorted and stored (based on the input from the user), while retrieval, we should be able to skip the documents without heapify (or read in reverse order) as the order is guaranteed now.
What do you think?
@backslasht The overhead of Index Sort
is very high. I ingested above 36 million
documents with/without @timestamp indexsort and different is minimum 20% plus. Refer below numbers.
Without Index sort on @timestamp
| Min Throughput | index-append | 182029 | docs/s |
| Mean Throughput | index-append | 197051 | docs/s |
| Median Throughput | index-append | 195665 | docs/s |
| Max Throughput | index-append | 210468 | docs/s |
| 50th percentile latency | index-append | 165.423 | ms |
| 90th percentile latency | index-append | 231.446 | ms |
| 99th percentile latency | index-append | 908.578 | ms |
| 99.9th percentile latency | index-append | 8934.86 | ms |
| 99.99th percentile latency | index-append | 10348.3 | ms |
| 100th percentile latency | index-append | 10806.6 | ms |
With index sort on @timestamp on ascending order.
| Min Throughput | index-append | 141237 | docs/s |
| Mean Throughput | index-append | 149861 | docs/s |
| Median Throughput | index-append | 146907 | docs/s |
| Max Throughput | index-append | 167086 | docs/s |
| 50th percentile latency | index-append | 210.367 | ms |
| 90th percentile latency | index-append | 315.659 | ms |
| 99th percentile latency | index-append | 1458.79 | ms |
| 99.9th percentile latency | index-append | 10476.3 | ms |
| 99.99th percentile latency | index-append | 10963.3 | ms |
| 100th percentile latency | index-append | 10994.6 | ms |
Problem statement
Currently in
TopFieldCollector
, we have PriorityQueue (min heap binary implemenation) to find topK
elements inasc
ordesc
order. Whenever we find newly iterated document competitive, we insert that in PriorityQueue and if size is full, we removemin
competitive document from PriorityQueue and replace with newly competitive document. Time complexity for this scenatio isO(nlog(K))
in worst case wheren = current number of competitive documents
. Now what ifn
is pretty high and every document we try to insert in PriorityQueue take exact O(log(K)) ? We will end up spending lot of time in that case within PriorityQueue heapification process itself. We have skipping logic to reduce iterator cost 'n' whenever we update PriorityQueue, but that doesnt work well in some scenario.For example,
for time series based logs/metric data where sort query is performed on
timestamp
field. This field will be always ever increasing (nearly sorted) field and it has very high cardinality as well ! Triggeringdesc
order queries wont optimize much with skipping logic as well PriorityQueue will always resultO(log(K))
for almost all heapification insertion because incming documents are coming in increasing order.Like imagine below scenario where we are inserting logical timestamp in incrwasing order
1, 2, 3, 4, ....
and insertion of 16 will result in worst case to heapify 16 till leaf node, same is true for17, 18, 19,.....
. (Imagine we need top15
hit so PQ size is 15).Here in above scenario, skipping logic doesnt work too to find top
K
indesc
order. But we can do better on priorityqueue side since we know incoming documents are inasc
order (nearly sorted) so can skip heapifying them.Solution proposed
The crux of the solution proposed is,
We dont need to heapify elements if the incoming order is sorted as per desired order
. The idea here is to keep incoming competitive documents in temporarycircular array
before inserting them into priorityqueue and trigger heapify.Algorithm
last
and remove element fromfirst
in this circular array implementation.TopFieldCollector.checkThreshold()
, keep inserting them in circular array until order breaks, i.e. if next document is smaller for descending order traveral or vice versa for ascending order. Remove first element from circular array if queue is full while inserting.With this, we will able to skip millions of hits from going through heapification process.
POC : POC code Implemented above POC for reference with LinkedList instead circular array. Also need some additional stuff to handle cases like duplicate element where docid is given preference (But those should be minor changes).
Caveat of this implementation
Above proposed solution works pretty well for ever increasing or ever decreasing ordered fields like time series based data. But this could be additional overhead if data is completely random. This will incur additional comparison to add element in circular array first and then inserting to actual heap.
We have two ways to implement this,
boolean
flag, and it can be settrue/false
at collector level similar to what we have forPointBasedOptimization
flag.The approach 2 sounds very safe, but it results on more configuration overhead for users. The approach 1 can introduce overhead, but with the random data the skipping logic based on BKD points works pretty well and insertion on priority queue itself wont be much and hence we dont see much overhead. I have tested this with
13
million documents randomly on a numeric field (long) and with solution 1, we didnt see any regression in latency since skipping logic skipped many hits already. We dont have benchmark for numeric sort in Lucene itself, but I did indexed LongPoint in random order with 13 million points/documents. I see similar latency for sort140 ms
while with this additional circular queue it is147 ms
onr6g.2xlarge
ec2 instances. (for finding desc order top 1000)Suggestions from community
I would like to hear suggestion thoughts from community for above proposed change and better way to implement this. Specially time series based
desc
order usecase is pretty widely used, this can be covering larger pie for our users to get more optimized sort queries on such workload.Edit
Performance gain with this optimizaion
For this, I have inserted
36 million
documents with below schema, keeping@timestamp
long field inasc
order almost.Below were the results, units are in
ms
, when I querytop(K)
in descending order. where k = 5, 100, 1000....after my changes,
The higher the value of
k
the higher heapifycation time is going to take.