Open aqluheng opened 4 months ago
CC: @marin-ma is looking on the shuffle reading code path
Thanks for reporting this issue. Could you provide the Gluten version/commit? How many worker nodes is used in this test?
@zhouyuan This is a new issue. I will look into it.
Thanks for reporting this issue. Could you provide the Gluten version/commit? How many worker nodes is used in this test?
@zhouyuan This is a new issue. I will look into it.
I use gluten: a1f83cdfb807209bb0258d5b08ecd2dec6a1492b, velox: 07c9c46c69d32ba75f8f3edf172dc6236a448dc0. I use 3 nodes, each with 8 executors. I found that when running that Stage, initially node1 hit a disk bottleneck, but the disk utilization on node3 was very low; following that, node3 hit a disk bottleneck, while the disk utilization on node1 was very low.
This looks abnormal. Both node 2 and node 3 are stragglers comparing with node 1, but there's no data skew in TPCH queries. Does Vanilla spark also has this issue? It's more likely to be a misconfiguration problem on the test machine. Could you double check the disk configuration on your test machine? How many disks are in use per node?
sar -d 1
on each nodes to make sure all disks are busy.I also tested TPCH SF6T on a 3-node cluster, but cannot reproduce this issue
Description
Backend
Velox
Perf description
When switching from tpch 1T to tpch 3T, the Shuffle Read time of q21 will significantly increase, and the system's disk/network resources are not fully utilized. In the DAG visualization, it can be seen that the time taken by the InputIterator in the 3T dataset is 60 times that of the 1T dataset.
In the Timeline, it can also be seen that the ShuffleRead time is unreasonable.
I observed the system's disk and network usage and found that the utilization is not high. I created a flame graph that includes sleep time, and it can be seen that the time taken by the LowCopyFileSegmentJniByteInputStream.read function has significantly increased.
1T:
3T:
![image](https://github.com/oap-project/gluten/assets/29010345/21c60dbf-6eec-4a5a-ab6c-54533b341f3d)
Spark version
Spark-3.3.2
Spark configurations
spark.executor.cores 4 spark.executor.memory 2G spark.executor.memoryOverhead 1g spark.memory.offHeap.size 10g spark.gluten.sql.columnar.backend.velox.maxSpillFileSize 1073741824