Closed rcrezende closed 6 years ago
The docs indicate that this should be accurate for small data sets. (Also, the compression
parameter doesn't seem to be accepted by this agg).
@polyfractal any ideas what is going on here?
Related to #10216
Yes, this appears (in part) to be related to #10216. ES v1.6 is still on T-Digest 3.0, which interpolates the last centroid in a non-ideal manner. Effectively this means the upper end of the ranks can become heavily skewed. See the linked bug-fix in #10216 for more details.
Regarding the lower end, I think this may just be the nature of the approximation. My understanding is that T-Digest maintains a list of weighted centroids, and computes ranks by finding the two closest centroids for a given rank and performing a weighted interpolation between them.
In the case of the lower bound, there is no centroid "on the left" of the lowest value. So T-Digest approximates this interpolation by using the midpoint between it's nearest neighbor on the right, and then just applying that to the left as a best guess.
In this example, you have a centroid at 2.7
and one at 11.9
, so the midpoint is (11.9-2.7)/2 == 4.6
. This value is then used on the "left" of the lowest centroid, meaning T-Digest will continue to interpolate as low as -1.9:
"percent_rank": {
"values": {
"-2.0": 0,
"-1.9": 0,
"-1.8": 0.27173913043478104,
"-1.0": 2.445652173913042,
"0.0": 5.163043478260868,
"1.0": 7.8804347826086945,
"2.0": 10.597826086956522
}
}
Which is obviously not idea :(
@jpountz I wonder, if we (ES or T-Digest) could track the min/max of the entire range and simply cutoff interpolation on the extreme ends? The CDF would no longer be smooth, but it would behave in a more logical manner. Alternatively I suppose you could just linearly interpolate to the min/max instead of "guessing" with the nearest centroid's delta.
But it makes me wonder...if the lowest centroid has a weight of one (e.g. it represents the truly lowest value), why does the algorithm try to interpolate to the "left" anyway? Shouldn't it start interpolation from the centroid itself and go up?
Finally, would it make sense to implement a "++" version of this, ala HyperLogLog++? We could maintain a list of values under a certain threshold. If there are fewer than n
values, the list is sorted and used to calculate percentiles/ranks. If it breaches n
, the array is replayed into a T-Digest sketch and proceeds as normal.
/cc @colings86 since he's worked with all the percentile ranks stuff a bunch too
when you say "ES v1.6 is still on T-Digest 3.0", does that imply there is an alternative implementation in 2.0 we might get better precision?
Ah, sorry, that was poor phrasing. ES 2.0 is also still on T-Digest 3.0 (as per https://github.com/elastic/elasticsearch/issues/10216). I'm not sure the status, it looks like there are still test failures due to the bugfix upstream.
Just as a note, in this case, the accuracy should improve if you put some more data into the percentiles. It is providing a poor estimation because there are only a few centroids (due to only having a handful of docs). Which is skewing the T-Digest interpolation on both the high and low end.
E.g. just adding a few more docs improves the ranks significantly:
POST percenttest/doc/_bulk
{"index":{}}
{"i":2.7}
{"index":{}}
{"i":85.9}
{"index":{}}
{"i":11.9}
{"index":{}}
{"i":51.5}
{"index":{}}
{"i":3}
{"index":{}}
{"i":15}
{"index":{}}
{"i":66.8}
{"index":{}}
{"i":32.1}
{"index":{}}
{"i":54.2}
{"index":{}}
{"i":71.3}
{"index":{}}
{"i":8.8}
{"index":{}}
{"i":72.9}
{
"took": 4,
"timed_out": false,
"_shards": {...},
"hits": {...},
"aggregations": {
"percent_rank": {
"values": {
"1.0": 0,
"2.0": 0,
"3.0": 8.743169398907105,
"4.0": 11.475409836065573,
"11.0": 26.747311827956988,
"12.0": 29.435483870967744,
"50.0": 56.18401206636501,
"52.0": 57.6923076923077,
"70.0": 77.59562841530055,
"85.0": 86.92307692307692,
"86.0": 87.56410256410257,
"87.0": 88.2051282051282
}
}
}
}
For the suggested accuracy as result of adding more data, if we consider that any aggregation can be a result of a filter, I'd prefer obtaining a confidence interval though.
Another alternative is to use the same approach OpenJDK uses for sorting arrays with a meta-algorithm. Instead of always use the same DualPivot algorithm, it has thresholds on the Array size to decide when to use Insertion-sort, QuickSort, CountingSort. See here: http://www.docjar.com/html/api/java/util/DualPivotQuicksort.java.html
So maybe there is a T where length of sequence <T, the approximation algorithm starts is not worth (precision is bad) nor efficient (time).
Here is another example with more data and still the bug happening.
Note that percentile_rank for 90 is greater than for 95 which is wrong.
PUT percenttest_2
{
"settings":
{
"number_of_shards": 1,
"number_of_replicas": 0
}
}
POST percenttest_2/doc/0
{"i": 47.600000}
POST percenttest_2/doc/1
{"i": 29.600000}
POST percenttest_2/doc/2
{"i": 44.700000}
POST percenttest_2/doc/3
{"i": 41.200000}
POST percenttest_2/doc/4
{"i": 60.900000}
POST percenttest_2/doc/5
{"i": 45.200000}
POST percenttest_2/doc/6
{"i": 49.700000}
POST percenttest_2/doc/7
{"i": 49.800000}
POST percenttest_2/doc/8
{"i": 59.400000}
POST percenttest_2/doc/9
{"i": 78.200000}
POST percenttest_2/doc/10
{"i": 100.000000}
POST percenttest_2/doc/11
{"i": 58.300000}
POST percenttest_2/doc/12
{"i": 54.200000}
POST percenttest_2/doc/13
{"i": 56.600000}
POST percenttest_2/doc/14
{"i": 51.900000}
POST percenttest_2/doc/15
{"i": 72.000000}
POST percenttest_2/doc/16
{"i": 87.500000}
POST percenttest_2/doc/17
{"i": 48.500000}
POST percenttest_2/doc/18
{"i": 58.300000}
POST percenttest_2/doc/19
{"i": 60.000000}
POST percenttest_2/doc/20
{"i": 48.200000}
POST percenttest_2/doc/21
{"i": 40.400000}
POST percenttest_2/doc/22
{"i": 33.200000}
POST percenttest_2/doc/23
{"i": 49.000000}
POST percenttest_2/doc/24
{"i": 35.700000}
POST percenttest_2/doc/25
{"i": 39.600000}
POST percenttest_2/doc/26
{"i": 39.600000}
POST percenttest_2/doc/27
{"i": 44.900000}
POST percenttest_2/doc/28
{"i": 39.000000}
POST percenttest_2/doc/29
{"i": 54.200000}
POST percenttest_2/doc/30
{"i": 61.000000}
POST percenttest_2/doc/31
{"i": 60.200000}
POST percenttest_2/doc/32
{"i": 36.700000}
POST percenttest_2/doc/33
{"i": 40.200000}
POST percenttest_2/doc/34
{"i": 58.600000}
POST percenttest_2/doc/35
{"i": 41.200000}
POST percenttest_2/doc/36
{"i": 27.600000}
POST percenttest_2/doc/37
{"i": 35.900000}
POST percenttest_2/doc/38
{"i": 42.700000}
POST percenttest_2/doc/39
{"i": 72.700000}
POST percenttest_2/doc/40
{"i": 40.700000}
POST percenttest_2/doc/41
{"i": 50.500000}
POST percenttest_2/doc/42
{"i": 53.200000}
POST percenttest_2/doc/43
{"i": 100.000000}
POST percenttest_2/doc/44
{"i": 42.700000}
POST percenttest_2/doc/45
{"i": 33.100000}
POST percenttest_2/doc/46
{"i": 45.400000}
GET percenttest_2/_search?search_type=count
{
"aggs": {
"percent_rank": {
"percentile_ranks": {
"field": "i",
"values": [
90,95,100
]
}
}
}
}
Returns
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 47,
"max_score": 0,
"hits": []
},
"aggregations": {
"percent_rank": {
"values": {
"90.0": 95.01268787819637,
"95.0": 0,
"100.0": 100
}
}
}
}
Tested this locally, has since been fixed. Was probably fixed when we upgraded to TDigest 3.2: #28305
I'm getting unexpected results with percentile rank statistic:
Request:
Response:
But there is no element < 2 and I'm getting 10.6% for that. Also 100% of values are < 87, but I'm getting 65.3%.
Using ES 1.6