elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.87k stars 24.72k forks source link

Allow percentiles_bucket use interpolation #51691

Open gbarba opened 4 years ago

gbarba commented 4 years ago

Change the behavior or add a param to make the percentiles_bucket pipeline aggregation interpolate the result instead of return the nearest value.

The current behaviour is expected as it's properly documented and it was also discussed in the PR that introduced this pipeline (https://github.com/elastic/elasticsearch/pull/13186#discussion_r38412622).

But I find it annoying that the percentils metrics aggregation interpolates the value so both are not consistent.

elasticmachine commented 4 years ago

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

alpparkan commented 4 years ago

Hi, if there is an agreed solution within the team I can work on this.

alpparkan commented 4 years ago

Hi @polyfractal, I was making the related changes to calculate percentiles with interpolation but there are couple of questions I want to ask before. Should I change the default behavior or add a new parameter to use interpolation? Also, is it ok to use percentiles from Apache Commons Math? If It's ok to use Commons but you want to use Excel style interpolation, we need to update the Apache Commons Math version from 3.2 to 3.4(min) to use Excel style computation or we can implement our own.

polyfractal commented 4 years ago

Thanks for the ping, missed your prior message :)

We try to avoid including dependencies unless we really need them, especially to the core server module. So in this case I think we should just implement the interpolation method of choice as they are relatively small.

We probably don't need the fancy interpolations though, or at least not all of them (or in the first PR). I was thinking a simple linear interpolation between adjacent values if the requested percentile doesn't land squarely on a value. We can certainly add more, but I think that's the most important / most used variation (especially since this agg is dealing with a complete "population" of values, not a sample)

Should I change the default behavior or add a new parameter to use interpolation?

Hmm, let's add a new parameter, that way we can add more interpolation types in the future if we want. Perhaps interpolation: none is the default, and then we can add interpolation: linear, etc for new types

Rockstar5645 commented 4 years ago

Is @alpparkan still working on this? Can I try it?

alpparkan commented 4 years ago

Hi @Rockstar5645, I just finished the implementation and I'll send a PR in a couple days.

timothy-choi commented 1 year ago

Hello is this issue still open? I would like to work on it.