elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.35k stars 24.55k forks source link

Reducers - Post processing of aggregation results #8110

Closed colings86 closed 9 years ago

colings86 commented 9 years ago

Overview

This feature would introduce a new 'reducers' module, allowing for the processing and transformation of aggregation results. The current aggregation module is very powerful and can compute varied and complex analytics but is limited when calculating analytics which depend on numerous independently calculated aggregated metrics. Currently, the only way to calculate these types of complex metrics is to write client side logic. The reducer module will add a new phase in the search request where the coordinating node can pass the reduced aggregations for further processing.

Key Concepts & Terminology

The following snippet captures the basic structure of aggregations:

"reducers" : [
    {
        "<reducer_name>" : {
            "<reducer_type>" : {
                <reducer_body>
            },
            "reducers" : [ 
                {
                    <sub_reducer_1>
                },
                {
                    <sub_reducer_2>
                },
                ...
           ] 
        }
    },
    {
        "<reducer_name_2>" : { ... }
    },    ...
]

Some rules for reducers requests:

Some possible implementations of reducers are detailed below:

Bucket Reducers

As an example, imagine that you have sales data in an index and you would like to calculate the derivative of the price field over time. This is useful for determining trends in the data, such as whether sales are increasing linearly over time.

You might start with the following aggregation:

{
  "aggs": {
    "months": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

The results of these aggregations might look something like the following:

"aggregations": {
  "months": {
     "buckets": [
        {
           "key_as_string": "2014/02/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 6,
           "avg_price": {
              "value": 11
           }
        },
        {
           "key_as_string": "2014/03/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 4,
           "avg_price": {
              "value": 19
           }
        },
        {
           "key_as_string": "2014/04/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 8,
           "avg_price": {
              "value": 31
           }
        }
     ]
  }
}

The sliding window reducer could then be used to request time windows of two months. A nested delta reducer inside the sliding window reducer could be used to calculate the rate of change of price for each two month selection.

"reducers": [
  {
    "two_months": {
      "sliding_window": {
        "buckets": "months",
        "window": 2
      },
      "reducers": [
        {
          "price_derivative": {
            "gradient": {
              "field": "avg_price.value"
            }
          }
        }
      ]
    }
  }
]

Which would produce the following response (derivative value is shown in price per month):

{
  "reducers": {
    "two_months": {
      "selections": [
        {
          "buckets": [
            {
              "key_as_string": "2014/02/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 6,
              "avg_price": {
                "value": 11
              }
            },
            {
              "key_as_string": "2014/03/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 4,
              "avg_price": {
                "value": 19
              }
            }
          ],
          "price_derivative": {
            "value": 8
          }
        },
        {
          "buckets": [
            {
              "key_as_string": "2014/03/01 00:00:00",
              "key": 1391212800000,
               "doc_count": 4,
               "avg_price": {
                "value": 19
               }
            },
            {
              "key_as_string": "2014/04/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 8,
              "avg_price": {
                "value": 31
              }
            }
          ],
          "price_derivative": {
            "value": 12
          }
        }
      ]
    }
  }
}
s1monw commented 9 years ago

+1

roytmana commented 9 years ago

Hi @colings86

Will it allow for expressions against aggregated values? Say I need to calculate a metric: number of docs[where field1=value1] divided by number of docs[where field2=value2]

would transformation merging (adding new and replacing existing) buckets from one agg into another be possible. My use cases for it are:

  1. Need to have terms agg where several terms MUST be present even if they would be cut off based on agg size/ordering. I do it by aggregating twice once normally and with filter and adding filtered results (at which point I can chose to add to top/bottom or sort properly handling dups to )
  2. Need to sub-aggregate only choosen parent agg buckets not all of them. Achieve it by doing two aggs one for all terms without sub-agg and one for selected terms with sub aggs and merging second into the first

Thanks alex

clintongormley commented 9 years ago

Also see #4983 for other use cases.

martijnvg commented 9 years ago

+1 :)

colings86 commented 9 years ago

@roytmana currently the implementations of the reducers are still being discussed but the design does allow for creating new buckets (which could be a merge of existing buckets) but not modifying existing buckets

abhijitiitr commented 9 years ago

+1

roytmana commented 9 years ago

Thank you @colings86. Is replacement of an existing bucket considered to be modification? I would not want to modify an existing bucket but rather replace it all together

I think combining two aggregation results is a hugely useful use case (I outline just couple of cases in my previous post which I and other who build generic interactive solutions based on ES would be interested in)

colings86 commented 9 years ago

I am trying to implement the Union Reducer and have hit a dilemma which I want some opinions on. Basically reducers need to be able to access and use multiple aggregations from anywhere in the aggregation tree. Thinking for the moment only about top-level reducers, I think it would be good if the logic to extract the aggregations was done in the Reducer class so that the implementations don't have to worry about doing it. That would mean the doReduce method would have a list of aggregations passed to it rather than a single aggregation.

The problem is that this makes reducers which only care about 1 aggregation a bit clunky since they each have to check that the list only contains one aggregation and then get it. its not a lot of code but will get repeated in a lot of places. I think the options are: 1) Deal with the list in each implementation and just leave the repetition 2) Create a SingleAggregationBucketReducer and a SingleAggregationMetricReducer but now have 4 types of reducer instead of two 3) suggestion from @polyfractal: Have the implementations receive a list of aggregations but have a helper method to check there is only 1 aggregation in the list and return it (throwing an exception if there is more than 1 aggregation) which can be used by the reducers which only want a single aggregation as input

Interested to hear peoples thought on this

brwe commented 9 years ago

I am actually not sure if we need to care for reducers that only take one aggregation. Instead we might also just implement all the "single" reducers so that they can also operate on an arbitrary number of aggregations and then return the result for any of the curves. For example, a reducer running on the following aggregation could return 1 or several results depending on if there are one or more terms in the terms aggregation:

{
  "aggs": {
    "terms": {
      "terms": {
        "field": "label",
        "size": 10
      },
      "aggs": {
        "histo": {
          "histogram": {
            "field": "number",
            "interval": 1
          }
        }
      }
    }
  }
}
Grauen commented 9 years ago

+1

mewwts commented 9 years ago

+1. Would love the ability to do a simple metric aggregation on bucket fields. For example, give me the average "doc_count" etc.

NicolasTr commented 9 years ago

:+1:

mrtheb commented 9 years ago

:+1:

My use case is the same as #6419

rbnacharya commented 9 years ago

:+1:

tigra commented 9 years ago

It would be great to be able to "search" within aggregation bucket names in some way, e.g. run a term aggregation over some query, and filter bucket names matching to a given string (e.g. by query string query rules, with given fuzziness).

E.g.

"reducers": { "matchingBucketz": { "query_string": { "query": "cell phones", "fuzziness": 1 } } }

dan-dr commented 9 years ago

yes. yes yes yes.

colings86 commented 9 years ago

Having started to implement bucket reducers using the style of API above we have found the API to be overly complex and very verbose to perform even straight forward tasks (such as a simple derivative of a metric). A lot of this is to do with a user having to carefully build their aggregations and then have to define multiple reducers and link them to the aggregations. We found that the only way to build reducer requests with any confidence is to first build the aggregations and then inspect the response to work out what the reducers request should look like. This is not what we want. We would like an API where reducers can be defined at the same time as the aggregations with a structure which is inuitive from the aggregation request structure (i.e. you shouldn't need to get the aggregations response to be able to write your reducers request)

So, we have rethought the API, going back to a more native design with the existing aggregations framework so that reducers look and feel like just another aggregation.

Now there is a new type of aggregation which takes the output of an aggregation tree and computes a new aggregation tree based on these results. The original sub-aggregation tree is destroyed in the process. This can be used to create functionality such as derivatives and moving average where the original data is not needed in the final output. With this new aggregation functionality we can have even more aggregation implementations. The first implementation of this will be the derivative (https://github.com/elasticsearch/elasticsearch/issues/9293).

jayhilden commented 9 years ago

:+1:

jasalmeron commented 9 years ago

So there is no option to directly get the division between two aggregations? I have this search query, and I wonder if it's possible to obtain the division between the balanceIncrement after the nested aggregations are done? ideally i want to calculate the payout:

sum balanceIncrement where 'type' = "prizes" / sum balanceIncrement where 'type' in ("play","ext) grouped by room and then grouped by day. What I have now is that summation is done but I don't find the way I can then divide between them and then get the payout. This is my query:

/fb/balance_trasnfer/_search?pretty

{   
    "query" : {
        "filtered": {
            "filter": {
               "range" : {
                    "transferDate" : {
                        "from" : "${BAL_TRANSFER_DATE_INIT}",
                        "to" : "${BAL_TRANSFER_DATE_PLUS_7_DAYS}"
                    }
                }
            }
        }
    },
    "size": 0,
    "aggs" : {
        "transfers_over_day" : {
            "date_histogram" : {
                "field" : "transferDate",
                "interval" : "day"
            },
            "aggs": {
                "group_by_room": {
                    "terms": {
                        "field": "roomName"
                    },                
                    "aggs" : {
                        "wins-and-bets" : {
                             "filters" : {
                                "filters" : {
                                  "win" :   { "term" : { "type" : "prize"}},
                                  "bet" : { "term" : { "type" : ["play","ext"]}}
                                }
                              },
                             "aggs" : {
                                 "sum_balance": {
                                   "sum" : {"field":"balanceIncrement"}
                                 }
                            }                    
                        }               
                    }
                }
            }
        }        
    }
}'
tcucchietti commented 9 years ago

:+1:

d-chabrovsky commented 9 years ago

+1

{
    "aggregations": {
        "currentMessageID": {
            "buckets": [
                {
                    "key": "5fbdfcb6-e6e9-446f-be59-e9527c006656",
                    "doc_count": 6,
                    "status": {
                        "buckets": [ {
                            "key": "FAILURE",
                            "doc_count": 6
                        } ]
                    },
                    "operationDateTm": { "value": 1422343776347 }
                },
                {
                    "key": "25a561c3-d45f-4f17-9ea9-e8b7a7cc14e8",
                    "doc_count": 6,
                    "status": {
                        "buckets": [ {
                            "key": "SUCCESS",
                            "doc_count": 1
                        } ]
                    },
                    "operationDateTm": { "value": 1422343964242 }
                }
            ]
        }
    }
}

Would be great to be able to filter out top buckets having the "SUCCESS" in the sub-aggregation's success field.

neilvarnas-zz commented 9 years ago

Why has this issue been closed ?

colings86 commented 9 years ago

As described in the comment above (https://github.com/elasticsearch/elasticsearch/issues/8110#issuecomment-69891827) the approach and API proposed in this issue turned out to be too complex and not user friendly. Work is still continuing to provide this kind of functionality but with a better more intuitive API. The start of this is the implementation of derivatives in #9293

neilvarnas-zz commented 9 years ago

Thank You, managed to miss that comment.

brupm commented 9 years ago

:+1:

lakshmi-guruparan commented 9 years ago

Hi All,

I have a requirement to aggregate on 3 fields.I need to perform range operation on the doc_count value of last aggregation. Is there a way I can accomplish this ? Let us say my query looks like the below -

"aggregations" : { "g1" : { "terms" : { "field" : "TECHNOLOGY", "size" : 100000 }, "aggregations" : { "g2" : { "terms" : { "field" : "FEATURE_TITLE", "size" : 100000 }, "aggregations" : { "g3" : { "terms" : { "field" : "SOFTWARE_TYPE", "size" : 100000,
} } } } } } }

I want to explore on the possibilities of filtering the results based in doc_count value of last aggregation(g3). Below is the result - g1: { buckets: [ { key: LAN Switching doc_count: 271 g2: { buckets: [ { key: EtherChannel doc_count: 8 g3: { buckets: [ { key: IOS doc_count: 8 } ] } }

Thanks.

OPtoss commented 9 years ago

While I see why reducers can create an overly complex API, I don't think the specialized route is any better. There are many actions I may want to perform on the resulting buckets from an aggregation besides the derivative. The derivative approach here (#9293) is simple and easy, but very specific. How do you plan to cover all the potential use cases for this?

For instance, I may want the entire filter API to filter the buckets returned from an aggregation. In my particular case, I want to only find the buckets with "doc_count" equal to 1. But I can see many more complex use cases for these "meta-filters".

Edit: I just found this #9876, which might answer how you're planning for other use cases. But I don't see an answer for my particular use case. Those seem to be aggregating aggregation results, not filtering them. Any idea how I can resolve my use case?

clintongormley commented 9 years ago

For instance, I may want the entire filter API to filter the buckets returned from an aggregation. In my particular case, I want to only find the buckets with "doc_count" equal to 1.

We plan to add a having agg which allows you to do exactly what you describe.

But I can see many more complex use cases for these "meta-filters".

Anything that these post-processing aggs can do, can also be done on the client. So the major benefit of adding these is convenience for the user. We'll keep adding aggs that are commonly useful, but if something is really specific, then it can still be done client side.

acarstoiu commented 9 years ago

Anything that these post-processing aggs can do, can also be done on the client. So the major benefit of adding these is convenience for the user. We'll keep adding aggs that are commonly useful, but if something is really specific, then it can still be done client side.

Actually no, not really. Of course one can theoretically crunch the data on the client side following any deterministic recipe, but that's not the idea! Going down that line, you'll end up saying that databases are optional, as you can always read all data in a huge (in-memory) structure and perform on them endless transformations :-1: I'm using Elasticsearch precisely because I do not want to bother with the feasibility of such a medieval approach :v:

So, no, I don't want to transfer tens of megabytes of aggregated data from Elasticsearch into my application just to be able to return to my consumer a few figures obtainable by further aggregating the aggregations. No, I would very much like the aggregations of aggregations to take place on the Elasticsearch coordinating node, as the application delivers high-level statistics.

Some post-aggregations applicable on the aggregations would cover more than 90% of the needs, I reckon. There's no need for fancier aggregations, the existing ones are enough; it's just that they cannot be used on their own results. And they should :wink:

clintongormley commented 9 years ago

@acarstoiu thank you for your essay. If, instead of venting your spleen, you'd followed some of the links above, you would have discovered the wonder that is pipeline aggregations: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline.html

Please do remember that there are real humans here, with real feelings, writing software which you are using.

acarstoiu commented 9 years ago

I'm aware of #9876, I just wanted to point out that your recommendation is evolutionary wrong. That has nothing to do with my using the software your team wrote, nor with my spleen (thanks for asking, by the way).

I hope this time was short enough for you (and your feelings). Last, but not least, thank you for the link, you're a good writer too!

jason2055141 commented 8 years ago

Hi!

Why elasticsearch sum aggs in API with timestamp, showing different result with kibana interface? Thanks in advance.

colings86 commented 8 years ago

Please ask questions like this on the https://discuss.elastic.co/ forums, this issues list is used for tracking bugs and feature requests. Also, so people can help you with your problem you will need to include a lot more detail about the problem you are seeing. Please read the following on how best to ask questions on the forum to get the best help: https://www.elastic.co/help

nishasingla1224 commented 8 years ago

I am using elasticsearch-1.5.1 and I want to apply a filter over derived filters. I tried using reducers but I guess it doesn't support reducers. Please suggest a solution.

polyfractal commented 8 years ago

@nishasingla1224 The "bucket reducer" functionality was added to Elasticsearch 2.0 under the name "Pipeline Aggregations". You can read the docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html

You will need to upgrade to 2.0+ to use them, however.

shuangshui commented 8 years ago

@acarstoiu Now is there a way to just return the size of the aggr bucket, without the bucket content. As in my case, the bucker is usually very big and not useful here

jrj-d commented 8 years ago

Hi everyone, + 1 for @shuangshui's question

sourcec0de commented 7 years ago

I am also interested in the answer to @shuangshui's question. @polyfractal any idea's?

polyfractal commented 7 years ago

@shuangshui @jrj-d @sourcec0de You can potentially use response filtering via filter_path() to cut down on the size of the response. This allows you to select which parts of the json tree you want to see. But it only selects trees/subtrees, not parts of individual elements. So you won't be able to select just the count... you'll have to grab the relevant subtree.

Still, it can help to cut down on very large responses, especially when you're just interested in some kind of metric calculated over a bunch of histograms for example