Open jtaylor-sfdc opened 11 years ago
Hi Jesse, when you finish an rc for this ticket, could you inform me?
Thanks Tony
Do you think we can optimize the query better if we have the cardinality information in the table? If so, hyperloglog might be a good choice.
Wow, that HyperLogLog is pretty interesting - thanks for the pointer. For stats, we're calculating it at major compression where a full pass is made through the data anyway, so I don't think it'll help there. But for COUNT DISTINCT and SELECT DISTINCT, it could definitely be useful.
It will only give out the cardinality but not the unique value itself. I'm thinking whether we can implement the combination of HyperLogLog and BloomFilter at the column value itself to determine the strategy to aggregate the data. If so, that would be awesome.
Our current stats gathering is way too simplistic - it's only keeping a cache per client connection to a cluster for the min and max key for a table. Instead, we should: