GreptimeTeam / greptimedb

An open-source, cloud-native, unified time series database for metrics, logs and events with SQL/PromQL supported. Available on GreptimeCloud.
https://greptime.com/
Apache License 2.0
4.35k stars 315 forks source link

Reduce `__sequence` field size in parquet files #5010

Open WenyXu opened 4 days ago

WenyXu commented 4 days ago

What type of enhancement is this?

Refactor

What does the enhancement do?

In our Parquet file analysis, the __sequence field occupies a disproportionate amount of file size, accounting for approximately 67% of the total size. This results in inefficient storage usage and potential performance bottlenecks.

File: 9bc23ce8-7046-4ff8-a209-1245827a7a89.parquet

Column Name Size (Bytes) Size (Ratio)
__op_type 54,825 0.00016 (0.016%)
greptime_value 39,894,514 0.117 (11.75%)
__sequence 228,302,552 0.672 (67.23%)
__primary_key 18,000,415 0.053 (5.30%)
greptime_timestamp 53,318,216 0.157 (15.70%)

The __sequence field clearly dominates the file size, overshadowing other important columns such as greptime_value and greptime_timestamp.

Implementation challenges

No response