getsentry / snuba

Search the seas for your lost treasure.
Other
331 stars 54 forks source link

Enumerate possible new schemas to evaluate in ClickHouse #5995

Closed mcannizz closed 1 month ago

mcannizz commented 1 month ago
### Tasks
- [ ] https://github.com/getsentry/tsdb-evaluation-sandbox/pull/23
nikhars commented 1 month ago

Since the percentiles column uses the most storage space, one option would be to deploy all variants of of quantiles available in clickhouse like quantileGK, quantileTiming, quantileDD, quantileBFloat16, quantileTDigest. Create separate columns for these and fill all of these columns. Then when its time to compare the different options we can check the sizes of these columns as well as the performance of each variant.

nikhars commented 1 month ago

Another option to add into the mix is not to do any aggregation of the distribution values but store the raw values themselves. On a week over week basis from the clickhouse graphs it seems like the partition size of raw tables and aggregated tables is approximately the same on the distributions cluster. What if we found a way to compress and encode the raw values such that storing the raw values becomes more storage efficient than the aggregated values.