[VL] Poor Shuffle Read Perf in TPCH

apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

https://gluten.apache.org/

Apache License 2.0

1.07k stars 380 forks source link

[VL] Poor Shuffle Read Perf in TPCH #4750

Open aqluheng opened 4 months ago

aqluheng commented 4 months ago

Description

Backend

Velox

Perf description

When switching from tpch 1T to tpch 3T, the Shuffle Read time of q21 will significantly increase, and the system's disk/network resources are not fully utilized. In the DAG visualization, it can be seen that the time taken by the InputIterator in the 3T dataset is 60 times that of the 1T dataset. In the Timeline, it can also be seen that the ShuffleRead time is unreasonable. I observed the system's disk and network usage and found that the utilization is not high. I created a flame graph that includes sleep time, and it can be seen that the time taken by the LowCopyFileSegmentJniByteInputStream.read function has significantly increased. 1T: 3T:

Spark version

Spark-3.3.2

Spark configurations

spark.executor.cores 4 spark.executor.memory 2G spark.executor.memoryOverhead 1g spark.memory.offHeap.size 10g spark.gluten.sql.columnar.backend.velox.maxSpillFileSize 1073741824

zhouyuan commented 4 months ago

CC: @marin-ma is looking on the shuffle reading code path

marin-ma commented 4 months ago

Thanks for reporting this issue. Could you provide the Gluten version/commit? How many worker nodes is used in this test?

@zhouyuan This is a new issue. I will look into it.

aqluheng commented 4 months ago

Thanks for reporting this issue. Could you provide the Gluten version/commit? How many worker nodes is used in this test?

@zhouyuan This is a new issue. I will look into it.

I use gluten: a1f83cdfb807209bb0258d5b08ecd2dec6a1492b, velox: 07c9c46c69d32ba75f8f3edf172dc6236a448dc0. I use 3 nodes, each with 8 executors. I found that when running that Stage, initially node1 hit a disk bottleneck, but the disk utilization on node3 was very low; following that, node3 hit a disk bottleneck, while the disk utilization on node1 was very low.

marin-ma commented 4 months ago

This looks abnormal. Both node 2 and node 3 are stragglers comparing with node 1, but there's no data skew in TPCH queries. Does Vanilla spark also has this issue? It's more likely to be a misconfiguration problem on the test machine. Could you double check the disk configuration on your test machine? How many disks are in use per node?

If possible, please also check that all disks are in use during runtime. A simple way is to observe the output from sar -d 1 on each nodes to make sure all disks are busy.
Looks like the spark.sql.shuffle.partitions is set to 200 in this chart (total 200 tasks). To get the best performance, it's suggested to set it to 2x/3x/4x total vcores. Assuming there are 32 vcores per node, using 192/288/384 would be better.

I also tested TPCH SF6T on a 3-node cluster, but cannot reproduce this issue