elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.07k stars 24.84k forks source link

Wrong Percentile Rank calculation #14851

Closed rcrezende closed 6 years ago

rcrezende commented 8 years ago

I'm getting unexpected results with percentile rank statistic:

Request:

PUT percenttest
{
  "settings": {
    "number_of_shards": 1, 
    "number_of_replicas": 0
  }
}
POST percenttest/doc/1
{
  "i": 2.7
}
POST percenttest/doc/2
{
  "i": 85.9
}
POST percenttest/doc/3
{
  "i": 11.9
}
POST percenttest/doc/4
{
  "i": 51.5
}
GET percenttest/_search?search_type=count
{
  "aggs": {
    "percent_rank": {
      "percentile_ranks": {
        "field": "i",
        "values": [
          1,2,3,4,11,12,50,52,70,85,86,87
        ]
      }
    }
  }
}

Response:

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "percent_rank": {
         "values": {
            "1.0": 7.8804347826086945,
            "2.0": 10.597826086956522,
            "3.0": 13.31521739130435,
            "4.0": 16.032608695652172,
            "11.0": 28.790983606557376,
            "12.0": 29.815573770491806,
            "50.0": 36.40988372093024,
            "52.0": 37.86337209302326,
            "70.0": 50.94476744186046,
            "85.0": 61.84593023255813,
            "86.0": 62.57267441860465,
            "87.0": 63.29941860465116
         }
      }
   }
}

But there is no element < 2 and I'm getting 10.6% for that. Also 100% of values are < 87, but I'm getting 65.3%.

Using ES 1.6

clintongormley commented 8 years ago

The docs indicate that this should be accurate for small data sets. (Also, the compression parameter doesn't seem to be accepted by this agg).

@polyfractal any ideas what is going on here?

clintongormley commented 8 years ago

Related to #10216

polyfractal commented 8 years ago

Yes, this appears (in part) to be related to #10216. ES v1.6 is still on T-Digest 3.0, which interpolates the last centroid in a non-ideal manner. Effectively this means the upper end of the ranks can become heavily skewed. See the linked bug-fix in #10216 for more details.

Regarding the lower end, I think this may just be the nature of the approximation. My understanding is that T-Digest maintains a list of weighted centroids, and computes ranks by finding the two closest centroids for a given rank and performing a weighted interpolation between them.

In the case of the lower bound, there is no centroid "on the left" of the lowest value. So T-Digest approximates this interpolation by using the midpoint between it's nearest neighbor on the right, and then just applying that to the left as a best guess.

In this example, you have a centroid at 2.7 and one at 11.9, so the midpoint is (11.9-2.7)/2 == 4.6. This value is then used on the "left" of the lowest centroid, meaning T-Digest will continue to interpolate as low as -1.9:

"percent_rank": {
  "values": {
    "-2.0": 0,
    "-1.9": 0,
    "-1.8": 0.27173913043478104,
    "-1.0": 2.445652173913042,
    "0.0": 5.163043478260868,
    "1.0": 7.8804347826086945,
    "2.0": 10.597826086956522
  }
}

Which is obviously not idea :(

@jpountz I wonder, if we (ES or T-Digest) could track the min/max of the entire range and simply cutoff interpolation on the extreme ends? The CDF would no longer be smooth, but it would behave in a more logical manner. Alternatively I suppose you could just linearly interpolate to the min/max instead of "guessing" with the nearest centroid's delta.

But it makes me wonder...if the lowest centroid has a weight of one (e.g. it represents the truly lowest value), why does the algorithm try to interpolate to the "left" anyway? Shouldn't it start interpolation from the centroid itself and go up?

Finally, would it make sense to implement a "++" version of this, ala HyperLogLog++? We could maintain a list of values under a certain threshold. If there are fewer than n values, the list is sorted and used to calculate percentiles/ranks. If it breaches n, the array is replayed into a T-Digest sketch and proceeds as normal.

/cc @colings86 since he's worked with all the percentile ranks stuff a bunch too

rcrezende commented 8 years ago

when you say "ES v1.6 is still on T-Digest 3.0", does that imply there is an alternative implementation in 2.0 we might get better precision?

polyfractal commented 8 years ago

Ah, sorry, that was poor phrasing. ES 2.0 is also still on T-Digest 3.0 (as per https://github.com/elastic/elasticsearch/issues/10216). I'm not sure the status, it looks like there are still test failures due to the bugfix upstream.

polyfractal commented 8 years ago

Just as a note, in this case, the accuracy should improve if you put some more data into the percentiles. It is providing a poor estimation because there are only a few centroids (due to only having a handful of docs). Which is skewing the T-Digest interpolation on both the high and low end.

E.g. just adding a few more docs improves the ranks significantly:

POST percenttest/doc/_bulk
{"index":{}}
{"i":2.7}
{"index":{}}
{"i":85.9}
{"index":{}}
{"i":11.9}
{"index":{}}
{"i":51.5}
{"index":{}}
{"i":3}
{"index":{}}
{"i":15}
{"index":{}}
{"i":66.8}
{"index":{}}
{"i":32.1}
{"index":{}}
{"i":54.2}
{"index":{}}
{"i":71.3}
{"index":{}}
{"i":8.8}
{"index":{}}
{"i":72.9}
{
   "took": 4,
   "timed_out": false,
   "_shards": {...},
   "hits": {...},
   "aggregations": {
      "percent_rank": {
         "values": {
            "1.0": 0,
            "2.0": 0,
            "3.0": 8.743169398907105,
            "4.0": 11.475409836065573,
            "11.0": 26.747311827956988,
            "12.0": 29.435483870967744,
            "50.0": 56.18401206636501,
            "52.0": 57.6923076923077,
            "70.0": 77.59562841530055,
            "85.0": 86.92307692307692,
            "86.0": 87.56410256410257,
            "87.0": 88.2051282051282
         }
      }
   }
}
rcrezende commented 8 years ago

For the suggested accuracy as result of adding more data, if we consider that any aggregation can be a result of a filter, I'd prefer obtaining a confidence interval though.

rcrezende commented 8 years ago

Another alternative is to use the same approach OpenJDK uses for sorting arrays with a meta-algorithm. Instead of always use the same DualPivot algorithm, it has thresholds on the Array size to decide when to use Insertion-sort, QuickSort, CountingSort. See here: http://www.docjar.com/html/api/java/util/DualPivotQuicksort.java.html

So maybe there is a T where length of sequence <T, the approximation algorithm starts is not worth (precision is bad) nor efficient (time).

rcrezende commented 8 years ago

Here is another example with more data and still the bug happening.

Note that percentile_rank for 90 is greater than for 95 which is wrong.

PUT percenttest_2
{
    "settings":
    {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}
POST percenttest_2/doc/0
{"i": 47.600000}
POST percenttest_2/doc/1
{"i": 29.600000}
POST percenttest_2/doc/2
{"i": 44.700000}
POST percenttest_2/doc/3
{"i": 41.200000}
POST percenttest_2/doc/4
{"i": 60.900000}
POST percenttest_2/doc/5
{"i": 45.200000}
POST percenttest_2/doc/6
{"i": 49.700000}
POST percenttest_2/doc/7
{"i": 49.800000}
POST percenttest_2/doc/8
{"i": 59.400000}
POST percenttest_2/doc/9
{"i": 78.200000}
POST percenttest_2/doc/10
{"i": 100.000000}
POST percenttest_2/doc/11
{"i": 58.300000}
POST percenttest_2/doc/12
{"i": 54.200000}
POST percenttest_2/doc/13
{"i": 56.600000}
POST percenttest_2/doc/14
{"i": 51.900000}
POST percenttest_2/doc/15
{"i": 72.000000}
POST percenttest_2/doc/16
{"i": 87.500000}
POST percenttest_2/doc/17
{"i": 48.500000}
POST percenttest_2/doc/18
{"i": 58.300000}
POST percenttest_2/doc/19
{"i": 60.000000}
POST percenttest_2/doc/20
{"i": 48.200000}
POST percenttest_2/doc/21
{"i": 40.400000}
POST percenttest_2/doc/22
{"i": 33.200000}
POST percenttest_2/doc/23
{"i": 49.000000}
POST percenttest_2/doc/24
{"i": 35.700000}
POST percenttest_2/doc/25
{"i": 39.600000}
POST percenttest_2/doc/26
{"i": 39.600000}
POST percenttest_2/doc/27
{"i": 44.900000}
POST percenttest_2/doc/28
{"i": 39.000000}
POST percenttest_2/doc/29
{"i": 54.200000}
POST percenttest_2/doc/30
{"i": 61.000000}
POST percenttest_2/doc/31
{"i": 60.200000}
POST percenttest_2/doc/32
{"i": 36.700000}
POST percenttest_2/doc/33
{"i": 40.200000}
POST percenttest_2/doc/34
{"i": 58.600000}
POST percenttest_2/doc/35
{"i": 41.200000}
POST percenttest_2/doc/36
{"i": 27.600000}
POST percenttest_2/doc/37
{"i": 35.900000}
POST percenttest_2/doc/38
{"i": 42.700000}
POST percenttest_2/doc/39
{"i": 72.700000}
POST percenttest_2/doc/40
{"i": 40.700000}
POST percenttest_2/doc/41
{"i": 50.500000}
POST percenttest_2/doc/42
{"i": 53.200000}
POST percenttest_2/doc/43
{"i": 100.000000}
POST percenttest_2/doc/44
{"i": 42.700000}
POST percenttest_2/doc/45
{"i": 33.100000}
POST percenttest_2/doc/46
{"i": 45.400000}

GET percenttest_2/_search?search_type=count
{
  "aggs": {
    "percent_rank": {
      "percentile_ranks": {
        "field": "i",
        "values": [
          90,95,100
        ]
      }
    }
  }
}

Returns

{
   "took": 47,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 47,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "percent_rank": {
         "values": {
            "90.0": 95.01268787819637,
            "95.0": 0,
            "100.0": 100
         }
      }
   }
}
polyfractal commented 6 years ago

Tested this locally, has since been fixed. Was probably fixed when we upgraded to TDigest 3.2: #28305