Aggregation Module - Phase 1 - Functional Design

uboness commented 11 years ago

_NOTE: at this point we're focusing more on the functional design aspect rather than performance. Once we get this nailed down, we'll see how far we can push and optimize._

Background

The new aggregations module is due to elasticsearch 1.0 release, and aims to serve as the next generation replacement for the functionality we currently refer to as "faceting". Facets, currently provide a great way to aggregate data within a document set context. This context is defined by the executed query in combination with the different levels of filters that are defined (filtered queries, top level filters, and facet level filters). Although powerful as is, the current facets implementation was not designed from ground up to support complex aggregations and thus limited. The main problem with the current implementation stem in the fact that they are hard coded to work on one level and that the different types of facets (which account for the different types of aggregations we support) cannot be mixed and matched dynamically at query time. It is not possible to compose facets out of other facet and the user is effectively bound to the top level aggregations that we defined and nothing more than that.

The goal with the new aggregations module is to break the barriers the current facet implementation put in place. The new name ("Aggregations") also indicate the intention here - a generic yet extremely powerful framework for defining aggregations - any type of aggregation. The idea here is to have each aggregation defined as a "standalone" aggregation that can perform its task within any context (as a top level aggregation or embedded within other aggregations that can potentially narrow its computation scope). We would like to take all the knowledge and experience we've gained over the years working with facets and apply it when building the new framework.

Before we dive into the meaty part, it's important to set some key concepts and terminology first.

Key Concepts & Terminology

Aggregation - An aggregation is the result of an aggregation :). There are many types of aggregations, some look similar , others have their own unique structure (all depending on the nature of the aggregation). For example, a terms aggregation holds a list of objects (buckets), each holding information about a unique term. While an avg aggregation, just holds the avg number aggregated over all values of a specific field/s within a well defined set of documents.
Aggregator - An aggregator is the computation unit in elasticsearch which generates aggregations. It is effectively responsible for aggregating the data during query phase, and at the end of this phase, create the output aggregation. Each aggregation type has a dedicated aggregator which knows how to compute and generate it.

There are two types of aggregators/aggregations:

Bucket - A family of aggregators whos main responsibility is to define the current document set context and split it into buckets, where each bucket defines a well defined document set context. Typically, all aggregators of this type will also return the document count in each bucket. This aggregator is composable, meaning, one can define other aggregations under it. It will then perform these defined aggregations for each of the buckets it builds. It is therefore possible to create buckets within buckets within buckets... up to any level of hierarchy one desires. For example, one can define a filter bucket that holds all the "active" users (for example, if the documents represent website users/visitors), under which she'll define a range bucket that build 3 buckets to represent different user age groups, under each age group she'll define a terms bucket to narrow down the most common tags each age group is using on the website. As you can see, creating hierarchies of buckets can be extremely powerful can immensely help when sliding & dicing your data.
Calc - A family of aggregators whos sole responsibility is to perform computation and calculate numbers. It always operates in a well defined scope of a document set. This document set scope is either the top most level one - the scope defined by the search query, or otherwise defined by a higher level bucket aggregator (as discussed above). The Calc Aggregators typically work on field values, therefore utilizing the field data from which they extract these values. But one can utilise scripts to compute custom values which will be aggregated in different ways (depending on the specific calc aggregator that is used). If combining (mixing & matching) all different types of aggregators, while bucket aggregators can be placed anywhere in the aggregation definition "tree", calc aggregators are always "leaves" on the tree as (unlike bucket aggregators) they cannot contain other aggregators.
Structuring Aggregations

The following snippet captures the basic structure of aggregations:

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : { 
            <aggregation_body>
        },
        ["aggregations" : { [<sub_aggregation>]* } ]
    }
    [,"<aggregation_name_2>" : { ... } ]*

}

The aggregations object (can also be aggs for short) in the json holds the aggregations you'd like to be computed. Each aggregation is associated with a logical name that the user defines (e.g. if the aggregation computes the average price, then it'll make sense to call it avg_price). These logical names, also uniquely identify the aggregations you define (you'll use the same names/keys to identify the aggregations in the response). Each aggregation has a specific type (<aggregation_type> in the above snippet) and is typically the first key within the named aggregation body. Each type of aggregation define its own body, depending on the nature of the aggregation (eg. the avg aggregation will define the field on which the avg will be calculated). At the same level of the aggregation type definition, one can optionally define a set of additional aggregations, but this only makes sense if the aggregation you defined is a bucketing aggregation. In this scenario, the aggregation you define on the bucketing aggregation level will be computed for all the buckets built by the bucketing aggregation. For example, if the you define a set of aggregations under the range aggregation, these aggregations will be computed for each of the range buckets that are defined.

In this manner, you can mix & match bucketing and calculating aggregations any way you'd like, create any set of complex hierarchies by embedding aggregations (of type bucket or calc) within other bucket aggregations. To better grasp how they can all work together, please refer to the examples section below.

Calc Aggregators

In this section will provide an overview of all calc aggregations available to date.

All the calc aggregators we have today belong to the same family which we like to call stats. All the aggregator in this family are based on values that can either come from the field data or from a script that the user defines.

These aggregators operate on the following context: { D, FV } where D is the the set of documents from which the field values are extracted, and FV is the set of values that should be aggregated. The aggregations take all those field values and calculates statistical values (some only calculate on value - they're called single value stats aggregators, while others generate a set of values - these are called multi-value stats aggregators).

Here are all currently available stats aggregators

Avg

Single Value Aggregator - Will return the average over all field values in the aggregation context, or what ever values the script generates

"aggs" : {
    "avg_price" : { "avg" : { "field" : "price" } }
}

"aggs" : {
    "avg_price" : { "avg" : { "script" : "doc['price']" } }
}

"aggs" : {
    "avg_price" : { "avg" : { "field" : "price", "script" : "_value" } }
}

_NOTE: when field and script are both specified, the script will be called for every value of the field in the context, and within the script you can access this value using the reserved variable _value.

Output:

"avg_price" : {
    "value" : 10
}

Min

Single Value Aggregator - Will return the minimum value among all field values in the aggregation context, or what ever values the script generates

"aggs" : {
    "min_price" : { "min" : { "field" : "price" } }
}

"aggs" : {
    "min_price" : { "min" : { "script" : "doc['price']" } }
}

"aggs" : {
    "min_price" : { "min" : { "field" : "price", "script" : "_value" } }
}

Output:

"min_price" : {
    "value" : 1
}

Max

Single Value Aggregator - Will return the maximum value among all field values in the aggregation context, or what ever values the script generates

"aggs" : {
    "max_price" : { "max" : { "field" : "price" } }
}

"aggs" : {
    "max_price" : { "max" : { "script" : "doc['price']" } }
}

"aggs" : {
    "max_price" : { "max" : { "field" : "price", "script" : "_value" } }
}

Output:

"max_price" : {
    "value" : 100
}

Sum

Single Value Aggregator - Will return the sum of all field values in the aggregation context, or what ever values the script generates

"aggs" : {
    "sum_price" : { "sum" : { "field" : "price" } }
}

"aggs" : {
    "sum_price" : { "sum" : { "script" : "doc['price']" } }
}

"aggs" : {
    "sum_price" : { "sum" : { "field" : "price", "script" : "_value" } }
}

Output:

"sum_price" : {
    "value" : 350
}

Count

Single Value Aggregator - Will return the number of field values in the aggregation context, or what ever values the script generates

"aggs" : {
    "prices_count" : { "count" : { "field" : "price" } }
}

"aggs" : {
    "prices_count" : { "count" : { "script" : "doc['price']" } }
}

"aggs" : {
    "prices_count" : { "count" : { "field" : "price", "script" : "_value" } }
}

Output:

"prices_count" : {
    "value" : 400
}

Stats

Multi Value Aggregator - Will return the following stats aggregated over the field values in the aggregation context, or what ever values the script generates:

avg
min
max
count
sum

"aggs" : {
    "price_stats" : { "stats" : { "field" : "price" } }
}

"aggs" : {
    "prices_stats" : { "stats" : { "script" : "doc['price']" } }
}

"aggs" : {
    "prices_stats" : { "stats" : { "field" : "price", "script" : "_value" } }
}

Output:

"prices_stats" : {
    "min" : 1,
    "max" : 10,
    "avg" : 5.5,
    "sum" : 55,
    "count" : 10,
}

Extended Stats

Multi Value Aggregator - an extended version of the Stats aggregation above, where in addition to its aggregated statistics the following will also be aggregated:

sum_of_squares
variance
std_deviation

"aggs" : {
    "price_stats" : { "extended_stats" : { "field" : "price" } }
}

"aggs" : {
    "prices_stats" : { "extended_stats" : { "script" : "doc['price']" } }
}

"aggs" : {
    "prices_stats" : { "extended_stats" : { "field" : "price", "script" : "_value" } }
}

Output:

"value_stats": {
    "count": 10,
    "min": 1.0,
    "max": 10.0,
    "avg": 5.5,
    "sum": 55.0,
    "sum_of_squares": 385.0,
    "variance": 8.25,
    "std_deviation": 2.8722813232690143
}

Bucket Aggregators

Bucket aggregators don't calculate values over fields like the calc aggregators do, but instead, they create buckets of documents. Each bucket defines a criteria (depends on the aggregation type) that determines whether or not a document in the current context "falls" in it. In other words, the buckets effectively define document sets (a.k.a docsets) on which the sub-aggregations are running on.

There a different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some define fixed number of multiple bucket, and others dynamically create the buckets while evaluating the docs.

The following describe the currently supported bucket aggregators.

Global

Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you're searching on, but is not influenced by the search query itself.

Note, global aggregators can only be placed as top level aggregators (it makes no sense to embed a global aggregator within another bucket aggregator)

"aggs" : {
    "global_stats" : {
        "global" : {}, // global has an empty body
        "aggs" : {
            "avg_price" : { "avg" : { "field" : "price" } }
        }
    }
}

Output

"aggs" : {
    "global_stats" : {
        "doc_count" : 100,
        "avg_price" : { "value" : 56.3 }
    }
}

Filter

Defines a single bucket of all the documents in the current docset context which match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.

"aggs" : {
    "active_items" : {
        "filter" : { "term" : { "active" : true } },
        "aggs" : {
            "avg_price" : { "avg" : { "field" : "price" } }
        }
    }
}

Output

"aggs" : {
    "active_items" : {
        "doc_count" : 100,
        "avg_price" : { "value" : 56.3 }
    }
}

Missing

A field data based single bucket aggregator, that creates a bucket of all documents in the current docset context that are missing a field value. This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values. (The examples bellow show how well the range and the missing aggregators play together).

"aggs" : {
    "missing_price" : {
        "missing" : { "field" : "price" }
    }
}

Output

"aggs" : {
    "missing_price" : {
        "doc_count" : 10
    }
}

Terms

A field data based multi-bucket aggregator where buckets are dynamically built - one per unique value (term) of a specific field. For each such bucket the document count will be aggregated (accounting for all the documents in the current docset context that have that term for the specified field). This aggregator is very similar to how the terms facet works except that it is an aggregator just like any other aggregator, meaning it can be embedded in other bucket aggregators and it can also hold any types of sub-aggregators itself.

"aggs" : {
    "genders" : {
        "terms" : { "field" : "gender" },
        "aggs" : {
            "avg_height" : { "avg" : { "field" : "height" } }
        }
    }
}

Output

"aggs" : {
    "genders" : {
        "terms" : [
            {
                "term" : "male",
                "doc_count" : 10,
                "avg_height" : 178.5
            },
            {
                "term" : "female",
                "doc_count" : 10,
                "avg_height" : 165
            },
        ]
    }
}

TODO: do we want to get rid of the "terms" level in the response and directly put the terms array under the aggregation name? (we do that in range aggregation)

Options

Name	Default	Required	Description
field	-	yes/no	the name of the field from which the terms will be taken. It is required if there is no other field data based aggregator in the current aggregation context and the script option is also not set
size	10	no	Only the top n terms will be returned, the size determines what this n is
order	count desc	no	the order in which the term bucket will be sorted, see bellow for possible values
script	-	no	one can choose to let a script generate the terms instead of extracting them verbatim from the field data. If the script is define along with the field, then this script will be executed for every term/value of the field data with a special variable _value which will provide access to that value from within the script (this is as opposed to specifying only the script, without the field, in which case the script will execute once per document in the aggregation context)

About order

One can define the order in which the term buckets will be sorted and therefore return in the response. There are 4 fixed/pre-defined order types and one more dynamic:

Order by term (alphabetically) ascending/descending:

"aggs" : {
    "genders" : {
        "terms" : { "field" : "gender", "order": { "_term" : "desc" } }
    }
}

Order by count (alphabetically) ascending/descending:

"aggs" : {
    "genders" : {
        "terms" : { "field" : "gender", "order": { "_count" : "asc" } }
    }
}

Order by direct embedded calc aggregation, ascending/descending. For single value calc aggregation:

"aggs" : {
    "genders" : {
        "terms" : { "field" : "gender", "order": { "avg_price" : "asc" } },
        "aggs" : {
            "avg_price" : { "avg" : { "field" : "price" } }
        }
    }
}

Or, for multi-value calc aggregation:

"aggs" : {
    "genders" : {
        "terms" : { "field" : "gender", "order": { "price_stats.avg" : "desc" } },
        "aggs" : {
            "price_stats" : { "stats" : { "field" : "price" } }
        }
    }
}

Range

A field data bucket aggregation that enables the user to define a field on which the bucketing will work and a set of ranges. The aggregator will check each field data value in the current docset context against each bucket range and "bucket" the relevant document & values if they match. Note, that here, not only we're bucketing by document, we're also bucketing by value. For example, let's say we're bucketing on multi-value field, and document D has values [1, 2, 3, 4, 5] for the field. In addition, there is a range bucket [ x < 4 ]. When evaluating document D, it seems to fall right in this range bucket, but it does so due to field values [1, 2, 3], not because values [4, 5]. Now… if this bucket will also have a sub-aggregators associated with it (say, sum aggregator), the system will make sure to only aggregate values [1, 2, 3] excluding [4, 5](as 4 and 5 as values, don't really belong to this bucket). This is quite different than the other bucket aggregators we've seen until now which mainly focused on whether the document falls in the bucket or not. Here we also keep track of the values belonging to each bucket.

"aggs" : {
    "age_groups" : {
        "range" : { 
            "field" : "age",
            "ranges" : [
                { "to" : 5 },
                { "from" : 5, "to" : 10 },
                { "from" : 10, "to" : 15 },
                { "from" : 15}
            ]
        },
        "aggs" : {
            "avg_height" : { "avg" : { "field" : "height" } }
        }
    }
}

Output

"aggregations" : {
    "age_groups" : [
        {
            "to" : 5.0,
            "doc_count" : 10,
            "avg_height" : 95
        },
        {
            "from" : 5.0,
            "to" : 10.0,
            "doc_count" : 5,
            "avg_height" : 130
        },
        {
            "from" : 10.0
            "to" : 15.0,
            "doc_count" : 4,
            "avg_height" : 160
        },
        {
            "from" : 15.0,
            "doc_count" : 10,
            "avg_height" : 175.5
        }
    ]
}

Of course, you normally don't want to store the age as a field, but store the birthdate instead. We can use scripts to generate the age:

"aggs" : {
    "age_groups" : {
        "range" : { 
            "script" : "DateTime.now().year - doc['birthdate'].date.year",
            "ranges" : [
                { "to" : 5 },
                { "from" : 5, "to" : 10 },
                { "from" : 10, "to" : 15 },
                { "from" : 15}
            ]
        },
        "aggs" : {
            "avg_height" : { "avg" : { "field" : "height" } }
        }
    }
}

As with all other aggregations, leaving out the field from calc aggregator, will fall back on the field by which the range bucketing is done.

"aggs" : {
    "age_groups" : {
        "range" : { 
            "field" : "age",
            "ranges" : [
                { "to" : 5 },
                { "from" : 5, "to" : 10 },
                { "from" : 10, "to" : 15 },
                { "from" : 15}
            ]
        },
        "aggs" : {
            "min" : { "min" : { } },
            "max" : { "max" : { } }
        }
    }
}

Output

"aggregations" : {
    "age_groups" : [
        {
            "to" : 5.0,
            "doc_count" : 10,
            "min" : 4.0,
            "max" : 5.0
        },
        {
            "from" : 5.0,
            "to" : 10.0,
            "doc_count" : 5,
            "min" : 5.0,
            "max" : 8.0
        },
        {
            "from" : 10.0
            "to" : 15.0,
            "doc_count" : 4,
            "min" : 11.0,
            "max" : 13.0
        },
        {
            "from" : 15.0,
            "doc_count" : 10,
            "min" : 15.0,
            "max" : 22.0
        }
    ]
}

Furthermore, you can also define a value script which will serve as a transformation to the field data value:

"aggs" : {
    "age_groups" : {
        "range" : { 
            "field" : "count",
            "script" : "_value - 3"
            "ranges" : [
                { "to" : 6 },
                { "from" : 6 }
            ]
        },
        "aggs" : {
            "min" : { "min" : {} },
            "min_count" : { "min" : { "field" : "count" } }
        }
    }
}

Output

"aggregations": {
    "count_ranges": [
      {
        "to": 6.0,
        "doc_count": 8,
        "min": {
          "value": -2.0
        },
        "min_count": {
          "value": 1.0
        }
      },
      {
        "from": 6.0,
        "doc_count": 2,
        "min": {
          "value": 6.0
        },
        "min_count": {
          "value": 9.0
        }
      }
    ]
  }

Notice, the min aggregation above acts on the actual values that were used for the bucketing (after the transformation by the script), while the min_count aggregation act on the values of the count field that fall within their bucket.

Date Range

A range aggregation that is dedicated for date values. The main difference between this date range agg. to the normal range agg. is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to json fields will be returned in the response:

"aggs": {
    "range": {
        "date_range": {
            "field": "date",
            "format": "MM-yyy",
            "ranges": [
                {
                    "to": "now-10M/M"
                },
                {
                    "from": "now-10M/M"
                }
            ]
        }
    }
}

In the example above, we created two range buckets:

the first will bucket all documents dated prior to 10 months ago
the second will bucket all document dated since 10 months ago

"aggregations": {
    "range": [
        {
            "to": 1.3437792E+12,
            "to_as_string": "08-2012",
            "doc_count": 7
        },
        {
            "from": 1.3437792E+12,
            "from_as_string": "08-2012",
            "doc_count": 2
        }
    ]
}

IP Range

Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IPv4 typed fields:

"aggs" : {
    "ip_ranges" : {
        "ip_range" : {
            "field" : "ip",
            "ranges" : [
                { "to" : "10.0.0.5" },
                { "from" : "10.0.0.5" }
            ]
        }
    }
}

Output:

"aggregations": {
    "ip_ranges": [
        {
            "to": 167772165,
            "to_as_string": "10.0.0.5",
            "doc_count": 4
        },
        {
            "from": 167772165,
            "from_as_string": "10.0.0.5",
            "doc_count": 6
        }
    ]
}

IP ranges can also be defined as CIDR masks:

"aggs" : {
    "ip_ranges" : {
        "ip_range" : {
            "field" : "ip",
            "ranges" : [
                { "mask" : "10.0.0.0/25" },
                { "mask" : "10.0.0.127/25" }
            ]
        }
    }
}

Output:

"aggregations": {
    "ip_ranges": [
      {
        "key": "10.0.0.0/25",
        "from": 1.6777216E+8,
        "from_as_string": "10.0.0.0",
        "to": 167772287,
        "to_as_string": "10.0.0.127",
        "doc_count": 127
      },
      {
        "key": "10.0.0.127/25",
        "from": 1.6777216E+8,
        "from_as_string": "10.0.0.0",
        "to": 167772287,
        "to_as_string": "10.0.0.127",
        "doc_count": 127
      }
    ]
}

Histogram

An aggregation that can be applied to numeric fields, and dynamically builds fixed size (a.k.a. interval) buckets over all the values of the document fields in the docset context. For example, if the documents have a field that holds a price (numeric), we can ask this aggregator to dynamically build buckets with interval 5 (in case of price it may represent $5). When the aggregation executes, the price field of every document within the aggregation context will be evaluated and will be rounded down to its closes bucket - for example, if the price is 32 and the bucket size is 5 then the rounding will yield 30 and thus the document will "fall" into the bucket the bucket that is associated withe the key 30. To make this more formal, here is the rounding function that is used:

bucket_key = value - value % interval

A basic histogram aggergation on a single numeric field value (maybe be single or multi valued field)

"aggs" : {
    "value_histo" : {
            "histogram" : {
                    "field" : "value",
                    "interval" : 3
            }
    }
}

An histogram aggregation on multiple fields

"aggs" : {
    "value_histo" : {
            "histogram" : {
                    "field" : [ "value", "values" ],
                    "interval" : 3
            }
    }
}

The output of the histogram is an array of the buckets, where each bucket holds its key and the number of documents that fall in it. This array can be sorted based on different attributes in an ascending or descending order:

_key - The buckets will be sorted by their key
_count - The buckets will be sorted by the number of documents that fall in them
aggName - Bucket may hold other aggegations that will be applied to those documents that fall in them. It is possible to sort the buckets based on direct single-valued calc aggregations that they hold
aggName & valueName - It is also possible to sort buckets based on direct multi-valued calc aggregations that they hold

Sorting by bucket key descending

"aggs" : {
    "histo" : {
        "histogram" : {
            "field" : "value",
            "interval" : 3,
            "order" : { "_key" : "desc" }
        }
    }
}

Sorting by document count ascending

"aggs" : {
    "histo" : {
        "histogram" : {
            "field" : "value",
            "interval" : 3,
            "order" : { "_count" : "asc" }
        }
    }
}

Adding a sum aggregation (which is a single valued calc aggregation) to the buckets and sorting by it

"aggs" : {
    "histo" : {
        "histogram" : {
            "field" : "value",
            "interval" : 3,
            "order" : { "value_sum" : "asc" }
        },
        "aggs" : {
            "value_sum" : { "sum" : {} }
        }
    }
}

Adding a stats aggregation (which is a multi-valued calc aggregation) to the buckets and sorting by the avg

"aggs" : {
    "histo" : {
        "histogram" : {
            "field" : "value",
            "interval" : 3,
            "order" : { "value_stats.avg" : "desc" }
        },
        "aggs" : {
            "value_stats" : { "stats" : {} }
        }
    }
}

Using value scripts to "preprocess" the values before the bucketing

"aggs" : {
    "histo" : {
        "histogram" : {
            "field" : "value",
            "script" : "_value * 4",
            "interval" : 3,
            "order" : { "sum" : "desc"}
        },
        "aggregations" : {
            "sum" : { "sum" : {} }
        }
    }
}

It's also possible to use document level scripts to compute the value by which the documents will be "bucketted"

"aggs" : {
    "histo" : {
        "histogram" : {
            "script" : "doc['value'].value + doc['value2'].value",
            "interval" : 3,
            "order" : { "stats.sum" : "desc" }
        },
        "aggregations" : {
            "stats" : { "stats" : {} }
        }
    }
}

Output:

"aggregations": {
  "histo": [
    {
      "key": 21,
      "doc_count": 2,
      "stats": {
        "count": 2,
        "min": 8.0,
        "max": 9.0,
        "avg": 8.5,
        "sum": 17.0
      }
    },
    {
      "key": 15,
      "doc_count": 2,
      "stats": {
        "count": 2,
        "min": 5.0,
        "max": 6.0,
        "avg": 5.5,
        "sum": 11.0
      }
    },
    {
      "key": 24,
      "doc_count": 1,
      "stats": {
        "count": 1,
        "min": 10.0,
        "max": 10.0,
        "avg": 10.0,
        "sum": 10.0
      }
    },
    {
      "key": 18,
      "doc_count": 1,
      "stats": {
        "count": 1,
        "min": 7.0,
        "max": 7.0,
        "avg": 7.0,
        "sum": 7.0
      }
    },
    {
      "key": 9,
      "doc_count": 2,
      "stats": {
        "count": 2,
        "min": 2.0,
        "max": 3.0,
        "avg": 2.5,
        "sum": 5.0
      }
    },
    {
      "key": 12,
      "doc_count": 1,
      "stats": {
        "count": 1,
        "min": 4.0,
        "max": 4.0,
        "avg": 4.0,
        "sum": 4.0
      }
    },
    {
      "key": 6,
      "doc_count": 1,
      "stats": {
        "count": 1,
        "min": 1.0,
        "max": 1.0,
        "avg": 1.0,
        "sum": 1.0
      }
    }
  ]
}

Date Histogram

Date histogram is a similar aggregation to the normal histogram (as described above) except that it can only work on date fields. Since dates are indexed internally as long values, it's possible to use the normal histogram on dates as well, but problem though stems in the fact that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, we need a spcial support for time based data. From functionality perspective, this historam supports the same features as the normal histogram. The main difference though is that the interval can be specified by time expressions.

Building a month length bucket intervals

"aggs" : {
    "histo" : {
        "date_histogram" : {
            "field" : "date",
            "interval" : "month"
        }
    }
}

or based on 1.5 months

"aggs" : {
    "histo" : {
        "date_histogram" : {
            "field" : "date",
            "interval" : "1.5M"
        }
    }
}

Other available expressions for interval: year, quarter, week, day, hour, minute, second

Since internally, dates are represented as 64bit numbers, these numbers are returned as the bucket keys (each key representing a date). For this reason, it is also possible to define a date format, which will result in returning the dates as formatted strings next to the numeric key values:

"aggs" : {
    "histo" : {
        "date_histogram" : {
            "field" : "date",
            "interval" : "1M",
            "format" : "yyyy-MM-dd"
        }
    }
}

Output:

"aggregations": {
    "histo": [
        {
          "key_as_string": "2012-02-02",
          "key": 1328140800000,
          "doc_count": 1
        },
        {
          "key_as_string": "2012-03-02",
          "key": 1330646400000,
          "doc_count": 2
        },
        ...
    ]
}

Timezones are also supported, enabling the user to define by which timezone they'd like to bucket the documents (this support is very similar to the TZ support in the DateHistogram facet).

Similar to the current date histogram facet, pref_offset & post_offset will are also supported, for offsets applying pre rounding and post rounding. The values are time values with a possible - sign. For example, to offset a week rounding to start on Sunday instead of Monday, one can pass pre_offset of -1d to decrease a day before doing the week (monday based) rounding, and then have post_offset set to -1d to actually set the return value to be Sunday, and not Monday.

Like with the normal histogram, both document level scripts and value scripts are supported. It is possilbe to control the order of the buckets that are returned. And of course, nest other aggregations within the buckets.

Both the normal histogram and the date_histogram now support computing/returning empty buckets. This can be controlled by setting the compute_empty_buckets parameter to true (defaults to false).

Geo Distance

An aggregation that works on geo_point fields. Conceptually, it works very similar to range aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document from the origin point and determine the bucket it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).

"aggs" : {
    "rings" : {
        "geo_distance" : {
            "field" : "location",
            "origin" : "52.3760, 4.894",
            "ranges" : [
                { "to" : 100 },
                { "from" : 100, "to" : 300 },
                { "from" : 300 }
            ]
        }
    }
}

Output

"aggregations": {
  "rings": [
    {
      "unit": "km",
      "to": 100.0,
      "doc_count": 3
    },
    {
      "unit": "km",
      "from": 100.0,
      "to": 300.0,
      "doc_count": 1
    },
    {
      "unit": "km",
      "from": 300.0,
      "doc_count": 7
    }
  ]
}

The specified field must be of type geo_point (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation. The origin point can accept all format geo_point supports:

Object format: { "lat" : 52.3760, "lon" : 4.894 } - this is the safest format as it's the most explicit about the lat & lon values
String format: "52.3760, 4.894" - where the first number is the lat and the second is the lon
Array format: [4.894, 52.3760] - which is based on the GeoJson standard and where the first number is the lon and the second one is the lat

By default, the distance unit is km but it can also accept: mi (miles), in (inch), yd (yards), m (meters), cm (centimeters), mm (millimeters).

"aggs" : {
    "rings" : {
        "geo_distance" : {
            "field" : "location",
            "origin" : "52.3760, 4.894",
            "unit" : "mi",
            "ranges" : [
                { "to" : 100 },
                { "from" : 100, "to" : 300 },
                { "from" : 300 }
            ]
        }
    }
}

There are two distance calculation modes: arc (the default) and plane. The arc calculation is the most accurate one but also the more expensive one in terms of performance. The plane is faster but less accurate. Consider using plane when your search context is narrow smaller areas (like cities or even countries). plane may return higher error mergins for searches across very large areans (e.g. cross atlantic search).

"aggs" : {
    "rings" : {
        "geo_distance" : {
            "field" : "location",
            "origin" : "52.3760, 4.894",
            "distance_type" : "plane",
            "ranges" : [
                { "to" : 100 },
                { "from" : 100, "to" : 300 },
                { "from" : 300 }
            ]
        }
    }
}

Nested

A special single bucket aggregation which enables aggregating nested documents:

assuming the following mapping:

"type" : {
        "properties" : {
            "nested" : { "type" : "nested" }
        }
    }
}

Here's how a nested aggregation can be defined:

"aggs" : {
    "nested_value_stats" : {
        "nested" : {
            "path" : "nested"
        },
        "aggs" : {
            "stats" : {
                "stats" : { "field" : "nested.value" }
            }
        }
    }
}

As you can see above, the nested aggregation requires the path of the nested documents within the top level documents. Then one can define any type of aggregation over these nested documents.

Output:

"aggregations": {
    "employees_salaries": {
        "doc_count": 25,
        "stats": {
            "count": 25,
            "min": 1.0,
            "max": 9.0,
            "avg": 5.0,
            "sum": 125.0
        }
    }
}

Examples

Filter + Range + Missing + Stats

Analyse the online product catalog web access logs. The following aggregation will only aggregate those logs from yesterday (the filter aggregation), providing information for different price ranges (the range aggregation), where per price range we'll return the price stats on that range and the total page views for those documents in the each range. We're also interested in finding all the bloopers - all those products that for some reason don't have prices associated with them and still they are exposed to the user and being accessed and viewed.

"aggs" : {
    "yesterday" : {
        "filter" : { "range" : { "date" { "gt" : "now-1d/d", "lt" : "now/d" } } },
        "aggs" : {
            "missing_price" : {
                "missing" : { "field" : "price" },
                "aggs" : {
                    "total_page_views" : { "sum" : { "field" : "page_views" } }
                }
            },
            "prices" : {
                "range" : {
                    "field" : "price",
                    "ranges" : [
                        { "to" : 100 },
                        { "from" : 100, "to" : 200 },
                        { "from" : 200, "to" 300 },
                        { "from" : 300 }
                    ]
                },
                "aggs" : {
                    "price_stats" : { "stats" : {} },
                    "total_page_views" : { "sum" : { "field" : "page_views" } }
                }
            }
        }
    }
}

Aggregating Hierarchical Data

Quite often you'd like to get aggregations on location in an hierarchical manner. For example, show all countries and how many documents fall within each country, and for each country show a breakdown by city. Here's a simple way to do it using hierarchical terms aggregations:

"aggs" : {
    "country" : {
        "terms" : { "field" : "country" },
        "aggs" : {
            "city" : {
                "terms" : { "field" : "city" }
            }
        }
    }
}

brusic commented 11 years ago

Definitely watching this thread. Too much to digest right now, but good work. I assumed that the long rumored facet refactoring would be at the implementation level and not at this higher level.

First question that pops into mind is what is the target release for this issue? The issue is tagged v1.0.0.Beta1, so will 1.0 be held off until this issue is release? So far, there are no new major issues tagged 1.0 with the exception of this one. Once again, very exciting work.

mattweber commented 11 years ago

Great work, these are so powerful!

uboness commented 11 years ago

@brusic

I assumed that the long rumored facet refactoring would be at the implementation level and not at this higher level.

So there are two reasons why we took this path:

we don't want to break/modify/change anything in the current facets while we're working on the aggregations
It is a big change, not only from implementation perspective, but also from a functional perspective - the way one approaches data aggregations in elasticsearch changes quite a bit, to the extent we believe it deserves its own module and set of apis.

First question that pops into mind is what is the target release for this issue? The issue is tagged v1.0.0.Beta1, so will 1.0 be held off until this issue is release? So far, there are no new major issues tagged 1.0 with the exception of this one. Once again, very exciting work.

the target release is indeed 1.0, and rest assured that more work will join this one :). As for 1.0 release timelines... we always planned to have this functionality in 1.0 and we always took it into account, so it doesn't really put a delay on things (just part of the work that needs to be done)

itsadok commented 11 years ago

You didn't mention it in Date Histogram, but are you going to add pre/post offset, like in #1599? Or should something like this be done manually, with a script?

jprante commented 11 years ago

Many thanks for the summary! I assume this is the facet refactoring we all wait for.

I would love to see term labels, collations for sorting, and pattern based formatting for the values in the terms aggregation. ("asc" and "desc" for an "order" is definitely not sufficient)

uboness commented 11 years ago

@itsadok definitely! added to the above description

pecke01 commented 11 years ago

@uboness Great write up. Thanks for sharing this with the community.

One thing that I have found useful in other aggregation engines I used is what they have called buckets. Instead of defining ranges like, 1 to 5, 6 to 10.. etc. It is possible to get an even spread by defining the number of buckets. Example requesting 3 buckets for price I would be able to get something like this: 1 - 12$ (10) 13 - 51$ (10) 51 - 120 (10)

Filtering will then adjust the buckets but will still be an even spread in 3 of them. This is ofc possible by doing several queries today but if it would be possible to get in there auto-magically I would like it. This might be possible from what you described above but I might have just not understood it.

uboness commented 11 years ago

@jprante this is just the initial phase, where we take all existing functionality of the facets and supporting it. In a later stage we definitely plan to enhance it further

I would love to see term labels

Can you elaborate on that?

pattern based formatting for the values

We have this currently supported for date values only... won't be hard to add it to numeric as well.

"asc" and "desc" for an "order" is definitely not sufficient

Why?

markharwood commented 11 years ago

I think I may have another category of faceting/summarisation to consider.

I'm interested in representing the strength of relationships between different buckets - i.e. the results are represented as a weighted graph not neatly contained hierarchies. Nodes are categories, edges represent varying degrees of association between categories.

One example might be showing how various IT skills in job ads are related (web vs database vs unix vs java vs search etc).

In my example implementation I instrumented various branches of my query with a number of "tagging wrapper" queries whose only purpose is to mark a point in the query tree that produces results of a certain type e.g.

tag=web
- query=css OR html OR jquery ..
tag=database
- query=oracle OR mysql...
tag=search
- query=lucene OR solr or "elastic search" or elasticsearch...

A special facetter can then find these TaggingScorers in the executing query tree and observe the doc ids firing from each of the Scorer streams and record strengths of association between pairs of tags as they are seen to match the same doc. Of course the categories (web vs java) etc could be defined as part of the facet info, acting only as a perspective onto a query's results, but I can imagine there are scenarios where, as in my example, they are defined as an integral part of the query selection criteria and faceting can usefully "listen in" on the various parts of the query to extract these categories.

There are a couple of concepts in here that have me thinking:

Is there generally a useful role for adding metadata to query clauses (e.g. for faceting or highlighting/explain introspection)
Weighted concept graphs are an interesting form of summary that can be derived quickly from large amounts of data as a by-product of Lucene scoring

There's a lot happening with facets design right now so I'm keen to throw these ideas into the mix.

aparo commented 11 years ago

This is a good way to improve facet/aggregation. Just an hint: in the output the _type of aggregation is missing. I often used in postprocessing of facet results and to check errors. Probably it's missing because it's a second level detail.

julianhille commented 11 years ago

Hi,

my 2 cents: I like the idea of the whole new way of aggregation, it will be, if speed is good, way more flexible.

My ideas what id like to see:

i strongly support the automagically "range sizes" like pecke01 said. We're asked about it a lot.
An "all facets" would be nice. On terms i often have to hard code a limit like 1000 to get all of them. But i dont know when the length of the terms reach this limit. Thus a "give me all facets" would be nice.
we use a lot of filters and also a lot of facets, but nearly every facet if have to put in more than twice. As an example if we have two books with different authors and the same different prices. I could filter for the price and get only one bock (as expected). But the facet for author needs every facet filter besides its own.

{
  query: {match_all:{}},
  filter: {
    and: [
      term: {
       price: 10
      },
      term: {
        author: someauthor
      }
    ]
  },
  facets: {
    author: {
      facet_filter: {
        term: {
          price: 10
        }
      },
      terms: {
        field: author,
        size: 10
      }
    }
  }
}

this should be solved different. Like a exclude filter for field option or something like that. Sorry at this point i dont come up with an idea to solve it.

fredbenenson commented 11 years ago

Very excited about this new direction for facets & aggregations.

I wanted to confirm one thing, and had a question about another.

First: a specific example (related to my reference to the elasticfacets plugin), where we're aggregating by a date histogram but also doing an aggregation of terms within each bucket.

Everything I've read in this issue indicates this is possible, so I'm just asking for confirmation. Here's how I think it'd look:

"aggs" : {
    "histo" : {
        "date_histogram" : {
            "field" : "date",
            "interval" : "month"
        }
    },
    "aggregations" : {
        "genders" : {
            "terms" : { "field" : "gender", "order": { "_term" : "desc" } }
        }
    }
}

The output would then show the count of documents per-gender per-month based on their date field.

Second: will 1.0 be backwards compatible with old-school (e.g. 0.20.1) faceting?

Thanks!

mattweber commented 11 years ago

Hey Fred,

I can confirm this will work. You need to move your genders aggregation up into the histo object like this though:

"aggs" : {
    "histo" : {
        "date_histogram" : {
            "field" : "date",
            "interval" : "month"
        },
        "aggregations" : {
            "genders" : {
                "terms" : { "field" : "gender", "order": { "_term" : "desc" } }
            }
        }
    }
}

jrick1977 commented 11 years ago

This is looking really good. One thing I cannot seem to find is an example of a nested aggregation, there is an example of a query but no example of the results. The type of query I would be interested in seeing would look like this:

    "aggs": {
        "genders": {
            "terms": {
                "field": "gender"
            },
            "aggs": {
                "age_groups" : {
                    "range" : {
                        "field" : "age",
                        "ranges" : [
                            { "to" : 5 },
                            { "from" : 5, "to" : 10 },
                            { "from" : 10, "to" : 15 },
                            { "from" : 15}
                        ]
                    },
                    "aggs" : {
                        "avg_height" : { "avg" : { "field" : "height" } }
                    }
                }
            }
        }
    }

I believe this should return an aggregation by gender and age.

uboness commented 11 years ago

@fredbenenson

will 1.0 be backwards compatible with old-school (e.g. 0.20.1) faceting?

We're definitely keeping the facet module for time being... the aggregations is just an additional separated api

uboness commented 11 years ago

@jrick1977

you'd get back something like this:

"aggregations": {
  "genders": {
    "terms": [
      {
        "term": "female",
        "doc_count": 4,
        "age_groups": [
          {
            "to": 20.0,
            "doc_count": 1,
            "avg_height": {
              "value": 160.0
            }
          },
          {
            "from": 20.0,
            "to": 25.0,
            "doc_count": 0,
            "avg_height": {
              "value": null
            }
          },
          {
            "from": 25.0,
            "to": 30.0,
            "doc_count": 2,
            "avg_height": {
              "value": 160.0
            }
          },
          {
            "from": 30.0,
            "doc_count": 1,
            "avg_height": {
              "value": 173.0
            }
          }
        ]
      },
      {
        "term": "male",
        "doc_count": 3,
        "age_groups": [
          {
            "to": 20.0,
            "doc_count": 0,
            "avg_height": {
              "value": null
            }
          },
          {
            "from": 20.0,
            "to": 25.0,
            "doc_count": 1,
            "avg_height": {
              "value": 175.0
            }
          },
          {
            "from": 25.0,
            "to": 30.0,
            "doc_count": 0,
            "avg_height": {
              "value": null
            }
          },
          {
            "from": 30.0,
            "doc_count": 2,
            "avg_height": {
              "value": 178.5
            }
          }
        ]
      }
    ]
  }
}

jprante commented 11 years ago

@uboness

With facet labels, a caller might be able to pass a map of codes and string values (the labels). If the aggregation completes and should list the values in the entries, they are matched against the codes to obtain a label. So, codes as field values could drive e.g. language dependent visualization, without extra lookup loop by the caller.

The reason why order asc/desc is not enough is because it always assume Unicode canonical sorting order. For multilingual texts localized to an environment, this does not suffice. For example, I need german phone book sorting order not only in sorting fields but also in facet entries. It would be nice to have Unicode locale- and collation-aware sorting of entries. Explained here for Java http://docs.oracle.com/javase/tutorial/i18n/text/collationintro.html ICU has much more sophisticated collations http://userguide.icu-project.org/collation/concepts

Another example to have custom sorting is natural sort order.

See also my pull requests, for ICU facets

https://github.com/elasticsearch/elasticsearch-analysis-icu/pull/7

and for collation-based sort keys

https://github.com/elasticsearch/elasticsearch/pull/2338

Thanks, and keep up the good work!

jrick1977 commented 11 years ago

@uboness Perfect!

uboness commented 11 years ago

@jprante

With facet labels, a caller might be able to pass a map of codes and string values (the labels). If the aggregation completes and should list the values in the entries, they are matched against the codes to obtain a label. So, codes as field values could drive e.g. language dependent visualization, without extra lookup loop by the caller.

gotcha

The reason why order asc/desc is not enough...

sure... I guess I misunderstood you, from order direction point of view asc/desc is enough... it's just that we need to be able make the order object extensible to support things like collation... ie:

"order" : {
   "by" : "name",
   "direction" : "asc",
   ...
}

I think it makes sense to support the above as well (and the form we have today for simplicity).

lukas-vlcek commented 11 years ago

Looking pretty nice!

If I may one thing: At the beginning I was confused by the terminology a bit. After some time I realized that one can think of this in terms of relational-algebra operations used in traditional SQL.

Bucket Aggregator -> Grouping
Calc Aggregator -> [vanilla] Aggregator

Nice definition of these operations in context of Map-Reduce can be found in Chapter 2. (p.32.), Mining of Massive Datasets. Maybe using terms Grouping and Aggregation would make it sound more familiar to people with "traditional / old-fashioned" background?

Also it seems to me that the terms Aggregation and Aggregator are somehow interchangeable? At least from the end user perspective it would not hurt to get rid of one of them? At least I do not see what role the Aggregator (as an computational unit) plays in this for now. May be it will make more sense from the Java API perspective later?

In the end some questions:

Do the buckets have to be distinct or can they overlap?
Would it make sense for a single bucket to have more then a one calc leave associated with it?

jprante commented 11 years ago

+1 for "Grouping" and "Aggregation" terms

uboness commented 11 years ago

@lukas-vlcek @jrick1977

thanks for this feedback!

Reg. changing terminology... one thing to note here is that both from impl. & user perspectives, both the grouping action and the aggregating actions (as you refer to them) are a type of an aggregation. Where ever you can define an aggregation, you can either put a clac or a bucket aggregation there. For this reason we need to have one name to refer to both, and we believed aggregations or aggs fit best. It's a framework that supports different types of aggregations. Changing the name "bucket" to "group" is fine... if the feedback we get is that it's a more fitting name, we'll do that, but we still need to refer to them as aggregations (just of different kind). making this distinction in the terminology will require changing the API to reflect that, and by that, most chances, will make the API more verbose/complex.

btw, if you have a better name of calc aggregations feel free to suggest that (we're kinda on the fence reg. this name).

Also it seems to me that the terms Aggregation and Aggregator are somehow interchangeable? At least from the end user perspective it would not hurt to get rid of one of them? At least I do not see what role the Aggregator (as an computational unit) plays in this for now. May be it will make more sense from the Java API perspective later?

Aggregation & Aggregator are two different things. From the user perspective, you don't need to know aggregator, just aggregation. The way you can look at it - an aggregator is the dynamic runtime representation of the aggregation. It does the actual aggregation job, and its output is the corresponding aggregation - which can be seen as a basic "static" data structure that holds the result of the aggregation. (if you compare it to facetings, aggregator is like facet executor and aggregation is like facet).

I agree that the user documentation should probably not mention aggregators at all... we'll fix that once we have formal docs for it.

Do the buckets have to be distinct or can they overlap?

Yes... for example you can have multiple ranges that overlap each other... no restriction from impl perspective.

Would it make sense for a single bucket to have more then a one calc leave associated with it?

It could, yeah... a simple example would be to get 2 different stats aggregation on 2 different fields

rmattler commented 11 years ago

Could you please comment if aggregation will solve my use case?

If I have documents with these values.

_source: {date: 01.01.2013 desc: XXX value: 100} _source: {date: 02.01.2012 desc: XXX value: 200} _source: {date: 03.01.2011 desc: XXX value: 300} _source: {date: 04.01.2011 desc: YYY value: 400} _source: {date: 05.01.2011 desc: YYY value: 500}

I need to produce:

Desc Last Date Last Value XXX 01.01.2013 100 YYY 05.01.2011 500

I need to get the documents with the most recent date for that desc so I can pull the value off of it. If aggregation can not give me the document would it be possible to give the document id with the most recent date for that desc? And I could use an id filter to get the documents.

Thanks for your time.

mattweber commented 11 years ago

@rmattler This should be asked on the mailing list. Ask it there and myself and others will actually respond.

roytmana commented 11 years ago

Very exciting - just what we need. One question I have is about consistency of the results given the distributed nature of the calculations. Now that ES is moving into the territory of BI it is even more important. Current facet implementation does not guarantee correct counts when ordered by count because distributed calculation and subsequent collation of the results. Will aggregation framework make any such guarantees? Without them BI applications will suffer greatly as businessbusers must have exact and not approximate results. Total count or sum must stay the same no matter how we aggregate inside a given bucket Thanks Alex

markharwood commented 11 years ago

BI applications will suffer greatly as businessbusers must have exact and not approximate results

Some additional thoughts about approaches for building user faith in numbers (dealing with fuzziness and the need for context):

As we move into BI a worry of mine is that search engines are not databases and they are designed to produce fuzzy sets (elements belong to the result set to varying degrees). In our search apps we frequently fail to explain to end users that the results can vary massively in match quality so they should not always put too much stock in any exact numbers we show. I think the first ES facet I wrote was the one that "buckets" the match scores of docs so you can draw a quality distribution for all of the search results. It was largely for my own benefit to view the "long tail" of low-quality matches for various query types (more like this, fuzzy etc). I can think of a number of approaches to coping-with-fuzzy-sets, none of which are ideal : a) Offer clients tools to first "trim" the long tail of crap e.g. only facet on the top N matches or b) Facet summaries can have the option to aggregate quality scores rather than absolute doc counts or c) "Fuzzy" criteria/result sets are automatically spotted and facets not offered or suitable accuracy warnings displayed alongside any aggregations d) Do nothing - educate users about interpreting results

Users also have to consider any natural skews in the data and it is often useful to put the numbers (inaccurately matched as they may be) into some sort of context:

The ability to express geo coverage as % of a background distribution: ( for why, see http://xkcd.com/1138/ )
"Top" term selection criteria for term-based facets need not always be "most popular" - Mike McCandless and I got into that on his Jira search project: http://goo.gl/vU73gc
Time-based buckets should be able to diff against corpus stats due to possible fluctuations in corpus size e.g. indeed.com plot skills as a % of all job ads: http://www.indeed.com/jobtrends?q=elasticsearch%2C+solr&l=

In all these scenarios, some background source of stats is required alongside the query result counts as context. The source of this data is typically found looking into the whole corpus.

For performance or API complexity reasons we may choose not to factor any support for these concerns into the core ES design, but it is at least worth acknowledging these issues exist while we are in the design phase.

Cheers Mark

roytmana commented 11 years ago

Mark,

One of the ways I deal with it now is by calculating two extra totals per say terms stats facet using stat facets - one total for the query and the other totals for "missing" for example if facet field is blank. With this two totals I can produce at leas consistent grand total, missing and other counts/sums even if returned top facets are no 100% correct other bucket captures the diff from corect grand total. Then user can request more facet values to be fetched and grand total will stay correct

I wish terms stats facet got missing and other totals consistent with terms facet to make it easier to deal with it but that of course does not solve consistency issue it just makes it less noticeable.

I understand performance concerns of the consistency but ideally it would be left to the implementor to choose slower over fuzzy if only you supported consistency guarantees

fterrier commented 11 years ago

+1 for the automagically "range sizes" like @pecke01 said

btiernay commented 11 years ago

I'm curious if this feature will support my use case which can be described as "nested query aggregation on a per hit basis". The difference here is that the aggregation context would be a hit and its direct and indirect sub-documents and not a top level doc set. A key difference is that the aggregations could be used to compute a score which can in turn be globally sorted upon. This is a departure from the Aggregations Module which appears to not have the concept of a hit.

In SQL terms, this would be the equivalent of a subquery in both a select list projection and order by clause. Although this type of aggregation is currently supported in a limited fashion using scripting and custom scoring, some types of aggregation are simply not possible in an efficient manner (e.g. number of grandchildren with a unique field value).

@martijnvg I'd be curious to get your insight here since you seem to be the "nested guy" :)

netconstructor commented 11 years ago

A+ guys... can't wait for these enhancements to get finalized!

bobrik commented 11 years ago

@uboness, I tried to use this (features/aggregations branch) with actual data and got weird results.

{
  "took" : 3818,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1560639,
    "max_score" : 1.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "countries" : {
      "terms" : [ {
        "term" : "RU",
        "doc_count" : 32763895
      }, {
        "term" : "UA",
        "doc_count" : 10192620
      }, {
        "term" : "BY",
        "doc_count" : 2970523
      }, {
        "term" : "TR",
        "doc_count" : 1982124
      }, {
        "term" : "US",
        "doc_count" : 1045321
      } ]
    }
  }
}

Why total hits is 1560639, but doc_count with term RU is 32763895?

Simplified query looks like this:

{
    "size": 0,
    "filter": {
        "term": {
            "@key": "whatever"
        }
    },
    "aggs": {
        "countries": {
            "terms": {
                "field": "cnt",
                "size": 5
            }
        }
    }
}

Putting filter inside of countries doesn't help. Am I missing something?

uboness commented 11 years ago

@bobrik with aggs we removed the option to put a filter under each agg. instead, we introduced the filter agg.

in the example above, you're using a top level filter, which (by its nature) only applies to the query hits you get back, not to the facet. you can either use a filtered query or if you wish you can nest the countries agg inside a filter agg as in:

{
    "aggs": {
        "whatever_countries" : {
            "filter" : { "term" : { "@key" : "whatever" }},
            "aggs" : {
                "countries": {
                    "terms": {
                        "field": "cnt",
                        "size": 5
                    }
                }
            }
        }
    }
}

bobrik commented 11 years ago

@uboness I tried putting filter inside of countries aggregation.

{
    "size": 0,
    "aggs": {
        "countries": {
            "filter": {
                "term": {
                    "@key": "whatever"
                }
            },
            "terms": {
                "field": "cnt",
                "size": 5
            }
        }
    }
}

But even if I specify non-existing term, I get same results. Looks like filter is not applied at all:

{
  "took" : 3736,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 65751499,
    "max_score" : 1.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "countries" : {
      "terms" : [ {
        "term" : "RU",
        "doc_count" : 32763895
      }, {
        "term" : "UA",
        "doc_count" : 10192620
      }, {
        "term" : "BY",
        "doc_count" : 2970523
      }, {
        "term" : "TR",
        "doc_count" : 1982124
      }, {
        "term" : "US",
        "doc_count" : 1045321
      } ]
    }
  }
}

Filtered query works as expected, btw.

I also noticed, that there's no way to get terms_stats-like aggregation. In my case I want to select 10 terms where some field has maximum sum across events. This is terms_stats ordered by total. Is there a reason why this is not ported to aggregations?

brusic commented 11 years ago

@bobrik, I have not tried it yet, but what Uri mentioned is to embed your countries agg inside a filter agg.

uboness commented 11 years ago

@bobrik regarding the fitlers, read my response above (and @brusic comment)

as for terms stats, with the new aggs framework there's no need for special aggs (like the terms_stats facet) as you can build composite aggs. For the case you describe above, you have the terms agg which can have a sum agg as its sub-aggregation.

{
    "aggs" : {
        "countries" : {
            "terms" : {
                "field": "cnt",
                "size": 5,
                "order" : { "sum_of_my_field" : "desc" }
            },
            "aggs" : {
                "sum_of_my_field" : { "sum" : { "field" : "my_field" } }
            }
        }
    }
}

I really recommend you to read the content of the ticket above, it'll give you enough background on the aggregation framework to understand how to properly use it

bobrik commented 11 years ago

My bad with filter aggregation — now it works.

About sub-aggregation: with terms + sum aggregation I cannot get only top N terms, I need to get them all in order to find out which ones are top N.

Imagine kibana topN query where topN is determined by sum of some field. Now it needs two requests: one to get terms (terms_stats facet) and another to get date_histogram for each term. Looks like with aggregations it will need same two requests. I wondered if I could collapse two requests in one. When you need to query 30 days of data with 7gb of data per day it probably matters. It seems that I need to get all terms and do some extra work on client to fit everything into single request.

Let's look at second request. Now it contains as many facets as many terms you want. With aggregations it could be collapsed into one aggregation with filter like field:(val1 OR val2 OR val3). If I want to get topN + rest, I will need to have the second aggregation where filter will look like field:(NOT (val1 OR val2 OR val3)). Am I right? Maybe (maybe not) it would be faster to omit two filters, because we actually process all documents. In this case we need to specify terms list explicitly, but we'll get "other" for free.

After writing all of these I think it's easier to have single request will all terms.

Is there a way to disable size limit for terms aggregation? Terms stats facet had all_terms to do this.

nilsga commented 11 years ago

Is this available in the current beta release?

clintongormley commented 11 years ago

@nilsga not yet, but coming soon :)

brusic commented 11 years ago

@nilsga, you can use the aggregations branch, but you would have to build it yourself: https://github.com/elasticsearch/elasticsearch/tree/features/aggregations

It is only slightly behind the master branch.

nilsga commented 11 years ago

Thanks, @brusic and @clintongormley. I built the aggregation feature branch, and it seems to be just what I need. I was happy to see that it supports multiple levels of sub aggregations which allows for a very flexible hierarchy of bucketing and aggregations! I'm really looking forward to this feature, and hope it will be ready for production usage very soon! :)

lmenezes commented 11 years ago

@uboness awesome :) will give it a go!

alexsv commented 10 years ago

I'm trying aggregations framework, built it yesterday According to the documentation, count aggregation name is 'count', but according to the code - I should write 'value_count' - what should be correct in the future?

uboness commented 10 years ago

hehe.. good catch.. actually the catch here is that we forgot to put the documentation for the value_count all together. This issue was initially created as a guideline... not as the official docs, the official docs are part of the source. In any case, it's changed to value_count (we figured it better expresses what we're actually counting :))

uboness commented 10 years ago

@revendless-team

Typically you'd go with option 1 as that's the typical nature of a document structure, e.g.:

{
    "name" : "John Doe",
    "address" :  {
        "district" : "Brooklyn",
        "city" :  "New York",
        "country" : "USA"
    }
}

You got the aggregation structure right for option 1:

{
    "aggs" : {
        "countries" : {
            "terms" : { "field" : "country" },
            "aggs" : {
                "cities" : {
                    "terms" : { "field" : "city" },
                    "aggs" : {
                        "districts" : {
                            "terms" : { "field" : "district" }
                        }
                    }
                }
            }
        }
    }
}

You could apply apply option 2 using the same logic (with this format, the children field is a single object and not an array), it makes the structure a unnecessarily more complex though:

{
    "country" : {
        "name" : "USA",
        "city" : {
            "name" : "New York",
            "district" : {
                "name" : "Brooklyn"
            }
        }
    }
}

Aggregation:

{
    "aggs" : {
        "countries" : {
            "terms" : { "field" : "country.name" },
            "aggs" : {
                "cities" : {
                    "terms" : { "field" : "country.city.name" },
                    "aggs" : {
                        "districts" : {
                            "terms" : { "field" : "country.city.district" }
                        }
                    }
                }
            }
        }
    }
}

Regarding option 3, you could also aggregate that using nested aggregation. This though is really unnatural document structure (not sure why you think it's the cleanest one of the three) and the aggregations becomes very heavy and very complex. Just to give an idea of the aggregation complexity... here's how you'd extract the fist level only:

{
    "aggs" : {
        "locations" : {
            "nested" : {
                "path" : "location"
            },
            "aggs" : {
                "countries" : {
                    "filter" : { "term" : { "level" : 1 }},
                    "aggs": {
                        "names" : {
                            "terms" : { "field" : "location.name" },
                            "aggs" : {
                                // aggregations on more levels in a similar structure to this one
                            }
                        }
                    }
            }
        }
    }
}

Re Beta2, hopefully in the coming week

revendless-team commented 10 years ago

@uboness thanks a lot, this is really helpful! :)

revendless-team commented 10 years ago

How is it possible to subaggregate document objetcts like this?

{
    "name" : "John Doe",
    "address" :  {
        "district" : {
               "id": "3", 
               "title": "Brooklyn",
               "url": "/brooklyn"
        },
        "city" : {
               "id": "2", 
               "title": "New York",
               "url": "/brooklyn-new-york"
        },
        "country" : {
               "id": "1", 
               "title": "USA",
               "url": "/brooklyn-new-york-usa"
        },
    }
}

To get the id, term, url and count combined for each facet aggregated? Not just the term "title"? (See https://github.com/elasticsearch/elasticsearch/issues/256)

We already tried this query (but it doesn't work):

{
    "aggs" : {
        "level0" : {
            "terms" : { "fields" : [ "address.district.id", "address.district.title", "address.district.url" ] },
            "aggs" : {
                "level1" : {
                    "terms" : { "fields" : [ "address.city.id", "address.city.title", "address.city.url" ] },
                    "aggs" : {
                        "level2" : {
                            "terms" : { "fields" : [ "address.country.id", "address.country.title", "address.country.url" ] }
                        }
                    }
                }
            }
        }
    }
}

jrick1977 commented 10 years ago

One additional question on aggregations. Do you allow me to filter out root documents based on the value of the aggregation? In SQL we would think of this as a HAVING clause.

dominiek commented 10 years ago

Great job on this. I can't wait for this to become available in the stable version. It's mindblowingly awesome.

Question, are there plans to add a 'from' option, for example, this query aggregates the top 5 countries:

{
    "size": 0,
    "aggs": {
        "countries": {
            "terms": {
                "field": "cnt",
                "size": 5
            }
        }
    }
}

But what if I just wanted the 'next 5'? Normally you would supply a 'from' parameter:

{
    "size": 0,
    "aggs": {
        "countries": {
            "terms": {
                "field": "cnt",
                "from": 5,
                "size": 5
            }
        }
    }
}

Unfortunately this doesn't seem to work. Having something like this will help a lot in making these aggregations available for standard user interfaces.

uboness commented 10 years ago

@jrick1977 no, we don't have support for that

@dominiek https://github.com/elasticsearch/elasticsearch/issues/4294

jrick1977 commented 10 years ago

@uboness I am wondering if there is a technical reason it is not supported or is it just a question of priority? This is something we are really in need of.

elastic / elasticsearch