databendlabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.85k stars 750 forks source link

feat(query): window partition by spill to disk #16441

Closed forsaken628 closed 1 month ago

forsaken628 commented 2 months ago

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

part of #15328

Changes:

  1. Local file direct io reading and writing
  2. Implement TempDirManager to provide local temporary file management, including usage control and cleanup.
  3. Add a new configuration item spill
  4. Window supports spill to local disk

How to enable:

  1. Check that the configuration item spill is configured correctly
  2. set window_partition_spilling_to_disk_bytes_limit = 10 * 1024 * 1024 * 1024 (10G), this is the maximum disk space allowed for each request, when the conditions of spill are met, the local disk will be used in preference

Tests

Type of change


This change isโ€‚Reviewable

BohuTANG commented 2 months ago

Is there some benchmark numbers for this PR?

forsaken628 commented 2 months ago

Is there some benchmark numbers for this PR?

@BohuTANG Here just replace the storage, where can I go to find the right environment to do benchmark? If the test environment is not representative, benchmark has no meaning.

BohuTANG commented 2 months ago

Is there some benchmark numbers for this PR?

@BohuTANG Here just replace the storage, where can I go to find the right environment to do benchmark? If the test environment is not representative, benchmark has no meaning.

I think @Dousir9 has some cases to do benchmark, because #16448 also need it.

Dousir9 commented 1 month ago

@forsaken628 Could you provide a benchmark to compare the query execution time with and without local spill ? You can set window_partition_spilling_memory_ratio to 0, 30, 60 for testing.

forsaken628 commented 1 month ago

benchmark:

settings:

set max_memory_usage = 16*1024*1024*1024;
set window_partition_spilling_memory_ratio = 30;

sql

EXPLAIN ANALYZE SELECT
    l_orderkey,
    l_partkey,
    l_quantity,
    l_extendedprice,
    ROW_NUMBER() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS row_num,
    RANK() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS rank_num
FROM
    lineitem ignore_result;

result

set window_partition_spilling_to_disk_bytes_limit = 0;
        โ”œโ”€โ”€ hash keys: [l_orderkey]
        โ”œโ”€โ”€ estimated rows: 600037902.00
        โ”œโ”€โ”€ cpu time: 474.358152575s
        โ”œโ”€โ”€ wait time: 197.201014451s
        โ”œโ”€โ”€ output rows: 600.04 million
        โ”œโ”€โ”€ output bytes: 26.82 GiB
        โ”œโ”€โ”€ numbers remote spilled by write: 112
        โ”œโ”€โ”€ bytes remote spilled by write: 26.69 GiB
        โ”œโ”€โ”€ remote spilled time by write: 181.29s
        โ”œโ”€โ”€ numbers remote spilled by read: 1535
        โ”œโ”€โ”€ bytes remote spilled by read: 26.69 GiB
        โ”œโ”€โ”€ remote spilled time by read: 102.371s
46 rows explain in 49.291 sec. Processed 0 rows, 0 B (0 row/s, 0 B/s)
set window_partition_spilling_to_disk_bytes_limit = 10*1024*1024*1024;
        โ”œโ”€โ”€ hash keys: [l_orderkey]
        โ”œโ”€โ”€ estimated rows: 600037902.00
        โ”œโ”€โ”€ cpu time: 972.806314767s
        โ”œโ”€โ”€ wait time: 277.454963277s
        โ”œโ”€โ”€ output rows: 600.04 million
        โ”œโ”€โ”€ output bytes: 26.82 GiB
        โ”œโ”€โ”€ numbers remote spilled by write: 84
        โ”œโ”€โ”€ bytes remote spilled by write: 16.72 GiB
        โ”œโ”€โ”€ remote spilled time by write: 681.639s
        โ”œโ”€โ”€ numbers remote spilled by read: 1328
        โ”œโ”€โ”€ bytes remote spilled by read: 16.72 GiB
        โ”œโ”€โ”€ remote spilled time by read: 53.13s
        โ”œโ”€โ”€ numbers local spilled by write: 282
        โ”œโ”€โ”€ bytes local spilled by write: 10.00 GiB
        โ”œโ”€โ”€ local spilled time by write: 78.319s
        โ”œโ”€โ”€ numbers local spilled by read: 4160
        โ”œโ”€โ”€ bytes local spilled by read: 10.00 GiB
        โ”œโ”€โ”€ local spilled time by read: 32.624s
52 rows explain in 102.438 sec. Processed 0 rows, 0 B (0 row/s, 0 B/s)
        โ”œโ”€โ”€ hash keys: [l_orderkey]
        โ”œโ”€โ”€ estimated rows: 600037902.00
        โ”œโ”€โ”€ cpu time: 896.504639969s
        โ”œโ”€โ”€ wait time: 279.228803569s
        โ”œโ”€โ”€ output rows: 600.04 million
        โ”œโ”€โ”€ output bytes: 26.82 GiB
        โ”œโ”€โ”€ numbers remote spilled by write: 95
        โ”œโ”€โ”€ bytes remote spilled by write: 16.75 GiB
        โ”œโ”€โ”€ remote spilled time by write: 535.451s
        โ”œโ”€โ”€ numbers remote spilled by read: 1497
        โ”œโ”€โ”€ bytes remote spilled by read: 16.75 GiB
        โ”œโ”€โ”€ remote spilled time by read: 44.231s
        โ”œโ”€โ”€ numbers local spilled by write: 430
        โ”œโ”€โ”€ bytes local spilled by write: 10.00 GiB
        โ”œโ”€โ”€ local spilled time by write: 64.566s
        โ”œโ”€โ”€ numbers local spilled by read: 6315
        โ”œโ”€โ”€ bytes local spilled by read: 10.00 GiB
        โ”œโ”€โ”€ local spilled time by read: 21.033s
52 rows explain in 91.103 sec. Processed 0 rows, 0 B (0 row/s, 0 B/s)
set window_partition_spilling_to_disk_bytes_limit = 30*1024*1024*1024;
        โ”œโ”€โ”€ hash keys: [l_orderkey]
        โ”œโ”€โ”€ estimated rows: 600037902.00
        โ”œโ”€โ”€ cpu time: 421.353882878s
        โ”œโ”€โ”€ wait time: 200.590871919s
        โ”œโ”€โ”€ output rows: 600.04 million
        โ”œโ”€โ”€ output bytes: 26.82 GiB
        โ”œโ”€โ”€ numbers local spilled by write: 1137
        โ”œโ”€โ”€ bytes local spilled by write: 26.74 GiB
        โ”œโ”€โ”€ local spilled time by write: 142.206s
        โ”œโ”€โ”€ numbers local spilled by read: 17646
        โ”œโ”€โ”€ bytes local spilled by read: 26.74 GiB
        โ”œโ”€โ”€ local spilled time by read: 56.713s
46 rows explain in 48.832 sec. Processed 0 rows, 0 B (0 row/s, 0 B/s)

The results are rather strange, when using only remote spills or just local spills, the time consumption is relatively normal. But when using a mix of them it causes a big jump in cpu time instead of wait time. The local test environment has only one hard disk, so the longer time may be caused by io contention.

Dousir9 commented 1 month ago

@forsaken628 Thanks for the benchmark result, we can try to read and write different IO devices for local spill and remote spill to determine whether it is IO contention.

forsaken628 commented 1 month ago

The big jump in CPU time when using a mix of local and remote spill has disappeared for some reason. It can now only be explained as an environmental issue, so keep watching.

        โ”œโ”€โ”€ hash keys: [l_orderkey]
        โ”œโ”€โ”€ estimated rows: 600037902.00
        โ”œโ”€โ”€ cpu time: 509.153733888s
        โ”œโ”€โ”€ wait time: 173.043153894s
        โ”œโ”€โ”€ output rows: 600.04 million
        โ”œโ”€โ”€ output bytes: 26.82 GiB
        โ”œโ”€โ”€ numbers remote spilled by write: 216
        โ”œโ”€โ”€ bytes remote spilled by write: 16.75 GiB
        โ”œโ”€โ”€ remote spilled time by write: 153.346s
        โ”œโ”€โ”€ numbers remote spilled by read: 3440
        โ”œโ”€โ”€ bytes remote spilled by read: 16.75 GiB
        โ”œโ”€โ”€ remote spilled time by read: 49.122s
        โ”œโ”€โ”€ numbers local spilled by write: 622
        โ”œโ”€โ”€ bytes local spilled by write: 10.00 GiB
        โ”œโ”€โ”€ local spilled time by write: 55.315s
        โ”œโ”€โ”€ numbers local spilled by read: 9712
        โ”œโ”€โ”€ bytes local spilled by read: 10.00 GiB
        โ”œโ”€โ”€ local spilled time by read: 26.595s
github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 1 month ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

Dousir9 commented 1 month ago

After testing, it was found that the local spill that occurred on the local machine and the cloud had the problem of memory not being released in time, which led to performance degradation.

BohuTANG commented 1 month ago

After testing, it was found that the local spill that occurred on the local machine and the cloud had the problem of memory not being released in time, which led to performance degradation.

If so, this PR should revert from main branch.