elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.01k stars 24.82k forks source link

[ML] Counts of missing terms from aggregations should be available #29761

Open elasticmachine opened 7 years ago

elasticmachine commented 7 years ago

Original comment by @sophiec20:

When analysing using pre-aggregated data (terms aggregations), only the top terms are returned.

Depending upon the cardinality of the data (and the size configured) this could mean that some terms are being dropped on the floor.

Terms aggregations return missing term counts i.e.sum_other_doc_count and doc_count_error_upper_bound. A meaningful representation of these values should be returned with datafeeds/{datafeed_id}/_stats.

Note: We intend to use aggs by default for Single Timeseries, and the end-user can already manually configure them in the Advanced Job Config. We may include a check box in the Advanced Config to generate the aggs syntax. Therefore missing values are likely, which would lead to poor quality of results.

Note: Discussion still pending as to whether pre-aggregated data is an experimental feature.

cc @colings86

droberts195 commented 6 years ago

There are different possible approaches here:

  1. A running datafeed keeps track of the relevant stats. The datafeed stats endpoint returns only the stats from the currently running datafeed. Nothing extra is persisted to an index and when you stop the datafeed you lose the stats.
  2. As 1., but additionally a notification is written to .ml-notifications if the counts of dropped terms exceed an acceptable level.
  3. Introduce a new type of result, similar to data counts but for datafeeds, that persists the stats so that they are available even after stopping the datafeed.

Obviously 3. is by far the most complicated option. A couple of points to note are: