[ML] Counts of missing terms from aggregations should be available

elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine

Other

1.01k stars 24.82k forks source link

Original comment by @sophiec20:

When analysing using pre-aggregated data (terms aggregations), only the top terms are returned.

Depending upon the cardinality of the data (and the size configured) this could mean that some terms are being dropped on the floor.

Terms aggregations return missing term counts i.e.sum_other_doc_count and doc_count_error_upper_bound. A meaningful representation of these values should be returned with datafeeds/{datafeed_id}/_stats.

Note: We intend to use aggs by default for Single Timeseries, and the end-user can already manually configure them in the Advanced Job Config. We may include a check box in the Advanced Config to generate the aggs syntax. Therefore missing values are likely, which would lead to poor quality of results.

Note: Discussion still pending as to whether pre-aggregated data is an experimental feature.

cc @colings86

There are different possible approaches here:

A running datafeed keeps track of the relevant stats. The datafeed stats endpoint returns only the stats from the currently running datafeed. Nothing extra is persisted to an index and when you stop the datafeed you lose the stats.
As 1., but additionally a notification is written to .ml-notifications if the counts of dropped terms exceed an acceptable level.
Introduce a new type of result, similar to data counts but for datafeeds, that persists the stats so that they are available even after stopping the datafeed.

Obviously 3. is by far the most complicated option. A couple of points to note are:

Datafeeds do not currently have an associated index. They could indirectly obtain one, by looking at the results index of their associated job. But this is binding datafeeds and jobs ever more tightly together, which will increase the complexity of ever trying to use datafeeds to feed data to something else in the future.
Would we really want to create and manage a new index to store these datafeed stats, when in the happy-day scenario they're all zero?

elastic / elasticsearch

[ML] Counts of missing terms from aggregations should be available #29761