elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.48k stars 8.04k forks source link

[Lens] Can't create a Lens visualization that produces accurate count(distinct) on field values #179934

Open bradquarry opened 3 months ago

bradquarry commented 3 months ago

I’m trying to figure out how I can run a count(distinct) query in Lens and generate precise results. Our Unique Count aggregation in Lens is producing wrong results vs an external deterministic unique count using python. I imagine it’s because we are using the imprecise cardinality aggregation in Lens, which chooses between two estimation algorithms for counts and does not use a deterministic approach. If true, this isn't good as financial services customers need to use Lens for precise high cardinality counts (millions of accounts) for daily reporting needs and you can't have inaccurate counts.

Anyway, to get an accurate distinct count I have to run a terms aggregation followed by a sum_bucket agg like so. The partitioning will be required as the counts are in the millions and I don't want to hit circuit breakers due to memory constraints.

How can I execute the type of query below in Lens to guarantee accurate unique count output every time from very high cardinality unique count aggregations? What if I had a billion unique values? Auto partitioning would be great here.

GET test/_search
{
  “aggs”: {
    “counts”: {
      “terms”: { “field”: “field.keyword”,
      “include”: {
               “partition”: 0,
               “num_partitions”: 2
            }
      }
    },“sum_buckets”: {
      “sum_bucket”: {
        “buckets_path”: “counts>_count”
      }
    }
  }
}

GET test/_search
{
  “aggs”: {
    “counts”: {
      “terms”: { “field”: “field.keyword”,
      “include”: {
               “partition”: 1,
               “num_partitions”: 2
            }
      }
    },“sum_buckets”: {
      “sum_bucket”: {
        “buckets_path”: “counts>_count”
      }
    }
  }
}
elasticmachine commented 3 months ago

Pinging @elastic/kibana-visualizations (Team:Visualizations)

markov00 commented 3 months ago

Thanks @bradquarry, I've found the original issue, I will link this there so we can have also your suggestion/pow https://github.com/elastic/kibana/issues/179934

bradquarry commented 3 months ago

@markov00 I don't think my issue is directly related to the issue you linked to. No matter how you use the cardinality aggregation in Elastic it does does not guarantee deterministic results, it simply chooses between a "more accurate or less accurate guesstimation". Even if you enable the parameter suggested you are still using guestimation algorithms. At least, this is my understanding of the cardinality aggregation algorithms.

This issue focuses on providing customers with the ability to reliable and deterministic results for any unique count in Lens using a different aggregation strategy. Also, no matter what the data cardinality.

Some customers don't trust our product in financial services due to the use of the cardinality agg in Lens. Their unique counts that must be exact aren't exact and we need to change this perception to grow in this market.

dej611 commented 3 months ago

Lens is using available aggregations from Elasticsearch, as that cannot be computed client-side (unless loading the whole values dataset in client memory...). I've found a related issue on Elasticsearch: https://github.com/elastic/elasticsearch/issues/15876

If the Elasticsearch decides to address it then we could consider to support it in Kibana Lens as well.

bradquarry commented 3 months ago

I provided an example of how to do this above using a term's aggregation and a sum bucket. Use the above plus msearch and some automatic or manual partitioning then sum the results of msearch on the client side. It works as a workaround for my customer, but they can't use Kibana. I'm experimenting with Vega to help, but it's a learning experience.

dej611 commented 3 months ago

As far as I know terms aggregation is approximate as well: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#terms-agg-doc-count-error