feat(query): Support use parquet format when spilling

forsaken628 commented 1 month ago

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Support use parquet format when spilling, you can switch to arrow ipc via set spilling_file_format = 'arrow'.

Tests

[x] Unit Test
[ ] Logic Test
[x] Benchmark Test
[ ] No Test - Explain why

Type of change

[ ] Bug Fix (non-breaking change which fixes an issue)
[x] New Feature (non-breaking change which adds functionality)
[ ] Breaking Change (fix or feature that could cause existing functionality not to work as expected)
[ ] Documentation Update
[ ] Refactoring
[ ] Performance Improvement
[ ] Other (please describe):

This change is

what-the-diff[bot] commented 1 month ago

PR Summary

Enhanced Data Configuration The team has updated the configurations for a component called SpillerConfig. It's now using a technology called parquet that helps in storing and retrieving data more efficiently.
Revamped Data Spilling Method The way our software handles 'spilling' (overflow of data from main memory to a backup storage) has been improved. It's now dealing with multiple blocks of data at once (vectorized spill), instead of one at a time. This should speed up any operations involving large amounts of data.
Added Serialization Capabilities We've added a new feature via the serialize.rs module. This allows our software to convert data blocks into Parquet format and vice versa. Parquet is great for efficiently handling large volumes of data, making operations faster and less resource-intensive.
Improved Data Management We have made significant improvements to the WindowPartitionBuffer, which manages overflowed data partitions. This should lead to a better management and organization of such data, helping operations run more smoothly.
Flexible Spilling Option A new setting (spilling_use_parquet) has been introduced that allows the choice between using Parquet or Arrow IPC format for data spilling. This flexibility means we can choose the format that best suits our current needs, optimizing performance and resource use.
Optimized Data Operations By refining the data handling functions, we have made data spill operations more efficient and clearly defined. This makes the code easier to maintain and could lead to quicker development in the future.

forsaken628 commented 1 month ago

Benchmark:

dataset: tpch sf100

settings:

set max_memory_usage = 16*1024*1024*1024;
set window_partition_spilling_memory_ratio = 30;
set window_partition_spilling_to_disk_bytes_limit = 30*1024*1024*1024;

sql

EXPLAIN ANALYZE SELECT
    l_orderkey,
    l_partkey,
    l_quantity,
    l_extendedprice,
    l_shipinstruct,
    l_shipmode,
    ROW_NUMBER() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS row_num,
    RANK() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS rank_num
FROM
    lineitem ignore_result;

set spilling_use_parquet = 0; 

        ├── estimated rows: 600037902.00
        ├── cpu time: 651.285131424s
        ├── wait time: 168.630024827s
        ├── output rows: 600.04 million
        ├── output bytes: 44.87 GiB

        ├── numbers local spilled by write: 208
        ├── bytes local spilled by write: 15.06 GiB
        ├── local spilled time by write: 136.856s

        ├── numbers local spilled by read: 3072
        ├── bytes local spilled by read: 15.06 GiB
        ├── local spilled time by read: 31.933s

set spilling_use_parquet = 1; 

        ├── estimated rows: 600037902.00
        ├── cpu time: 848.406496078s
        ├── wait time: 73.858260885s
        ├── output rows: 600.04 million
        ├── output bytes: 44.87 GiB

        ├── numbers local spilled by write: 208
        ├── bytes local spilled by write: 9.56 GiB
        ├── local spilled time by write: 55.665s

        ├── numbers local spilled by read: 3072
        ├── bytes local spilled by read: 9.56 GiB
        ├── local spilled time by read: 17.512s

Compared with arrow ipc, the optimization of parquet's file size mainly comes from dictionary encoding. parquet's cpu usage is quite high at the same time. There is no significant advantage for highly discrete data.

github-actions[bot] commented 1 month ago

Docker Image for PR

tag: pr-16612-3f8af35-1729002801

note: this image tag is only available for internal use, please check the internal doc for more details.

sundy-li commented 4 weeks ago

LGTM, need rebase.

databendlabs / databend