Open JosephTucci opened 3 years ago
@iverase as a HLL contributor I'm curious of your thoughts on this proposal
Pinging @elastic/es-analytics-geo (Team:Analytics)
Hey @JosephTucci,
Sorry for the late reply but I was OOO last week. I understand what are you looking for but I don't think the way to achieve it is by adding a new parameter. I think this is more related to https://github.com/elastic/elasticsearch/issues/64777 where you want to sample the documents before aggregating.
The difference is that in your proposal you only consider the first N documents where in the sampling proposal you consider N random documents.
@iverase Great! I am happy to see there is a proposal to address this issue across more aggregations.
@iverase your suggestion worked perfectly for our indices that have distinct entities. Thank you!
We have another larger set of indices that represents the relationship between a student and a teacher. When performing any terms aggregations, i must call Cardinality() upon to get the distinct counts. For large sets of documents 1M+ I see the performance starts to scale linearly with the size of matching documents.
Can you recommend any other approaches to investigate?
I want to compute the cardinality of potentially 50M+ documents, from a vast number of user generated filters on the fly: How many students in grades {9,10,11} with an average above 80? And how many teachers taught those children? How many male students studied Physics in 8th grade? And how many teachers taught those children? ....... Transforming the documents to distinct students does not work because I want to bind all their teachers, potentially 100s to a single document to filter on teacher statistics as well:
How many students were taught by a teacher without a Masters degree in 2018? And how many teachers taught those children?
I've tried having a student and teacher index. then join on the client side. This was too slow as potentially having pull down millions of documents, when i only want a count.
I will always return the number of distinct students and the number of distinct teachers. I could see this for other complex models that push the bounds of elastic: doctor-patient, salesRep-customer and other A-B relationships.
Is there an appetite for: when computing cardinality aggregation, specify an optional maximum documents to sample. When the number is reached the underlying algorithm is computed for the specified maximum documents, getting a final estimate by multiplying the sampled distinct count by the ratio of matching documents to sampled distinct count by. This would trade response time for accuracy.
For example: assume an index exists named student-teacher-index containing a document for each student for each teacher that taught that student. Each student and teacher having their own ids inside the document.
if the index currently contains 1 billion documents requesting a distinct count of students elasticsearch performs HLL on the first 1M documents, yielding 200k distinct students.
1 billion / 1 million = 1,000 200k distinct students * 1,000 = ~200M distinct patients.
speeding up compute time as elastic only has to hash 1M documents instead of 1B.