UBOdin / mimir

Data-ish exploration through SQL+Uncertainty
http://mimirdb.info
Apache License 2.0
27 stars 13 forks source link

Switch to HyperLogLog for domain tests #369

Open okennedy opened 4 years ago

okennedy commented 4 years ago

The shape watcher lens currently runs a Count Distinct query during the training phase to discover categorical attributes. This is not great for large datasets. Fortunately, we don't care about the actual number of distinct values... just that they're below some threshold. HyperLogLog count would be a much more efficient way to achieve the same goal.