Cardinality Aggregation - optional maximum sample documents parameter

JosephTucci commented 3 years ago

I want to compute the cardinality of potentially 50M+ documents, from a vast number of user generated filters on the fly: How many students in grades {9,10,11} with an average above 80? And how many teachers taught those children? How many male students studied Physics in 8th grade? And how many teachers taught those children? ....... Transforming the documents to distinct students does not work because I want to bind all their teachers, potentially 100s to a single document to filter on teacher statistics as well:

How many students were taught by a teacher without a Masters degree in 2018? And how many teachers taught those children?

I've tried having a student and teacher index. then join on the client side. This was too slow as potentially having pull down millions of documents, when i only want a count.

I will always return the number of distinct students and the number of distinct teachers. I could see this for other complex models that push the bounds of elastic: doctor-patient, salesRep-customer and other A-B relationships.

Is there an appetite for: when computing cardinality aggregation, specify an optional maximum documents to sample. When the number is reached the underlying algorithm is computed for the specified maximum documents, getting a final estimate by multiplying the sampled distinct count by the ratio of matching documents to sampled distinct count by. This would trade response time for accuracy.

For example: assume an index exists named student-teacher-index containing a document for each student for each teacher that taught that student. Each student and teacher having their own ids inside the document.

POST student-teacher-index/_search?size=0
{
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "studentId",
        "precision_threshold": 1000,
        "max_sample_documents": 1000000 //1M
      }
    }
  }
}

if the index currently contains 1 billion documents requesting a distinct count of students elasticsearch performs HLL on the first 1M documents, yielding 200k distinct students.

1 billion / 1 million = 1,000 200k distinct students * 1,000 = ~200M distinct patients.

speeding up compute time as elastic only has to hash 1M documents instead of 1B.

JosephTucci commented 3 years ago

@iverase as a HLL contributor I'm curious of your thoughts on this proposal

elasticmachine commented 3 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

iverase commented 3 years ago

Hey @JosephTucci,

Sorry for the late reply but I was OOO last week. I understand what are you looking for but I don't think the way to achieve it is by adding a new parameter. I think this is more related to https://github.com/elastic/elasticsearch/issues/64777 where you want to sample the documents before aggregating.

The difference is that in your proposal you only consider the first N documents where in the sampling proposal you consider N random documents.

JosephTucci commented 3 years ago

@iverase Great! I am happy to see there is a proposal to address this issue across more aggregations.

JosephTucci commented 2 years ago

@iverase your suggestion worked perfectly for our indices that have distinct entities. Thank you!

We have another larger set of indices that represents the relationship between a student and a teacher. When performing any terms aggregations, i must call Cardinality() upon to get the distinct counts. For large sets of documents 1M+ I see the performance starts to scale linearly with the size of matching documents.

When I add the Sampler Aggregation in I'm failing to accurately projection counts out. Seeing error rates of upwards of +15% off. I'd like to see 5%
I tried using the Diverse Sampler (upon the student id or teacher id) but these indices have anywhere from 500m to 1.3b documents at 800GB to 1.7TB across 20 shards. (20 node cluster running 7.14.2) While better I still see the skew.

Can you recommend any other approaches to investigate?

elastic / elasticsearch

Cardinality Aggregation - optional maximum sample documents parameter #73129