apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.14k stars 411 forks source link

Driver for cluster gets stuck at "TaskSchedulerImpl: Adding task set 0.0 with ..." #470

Open theosib-amazon opened 1 year ago

theosib-amazon commented 1 year ago

I have configured two clusters of six r5d.8xlarge nodes on EC2 and manually configured Spark 3.2.2 with hadoop 3.2, which I downloaded from the Apache Spark website.

On the baseline cluster, I configured vanilla Spark, and I'm able to run all TPC-DS queries. The main things I had to add to get that to work were a few jars to enable accessing S3:

hadoop-aws-3.3.1.jar
hadoop-common-3.3.1.jar
aws-java-sdk-bundle-1.11.901.jar

And I put this in the spark-defaults.conf file for every node:

spark.driver.memory                5g
spark.executor.memory              240g

The second cluster is basically the same, but I added the following jars to spark-3.2.2-bin-hadoop3.2/jars on all nodes:

hadoop-aws-3.3.1.jar
hadoop-common-3.3.1.jar
aws-java-sdk-bundle-1.11.901.jar
gluten-1.0.0-SNAPSHOT-jar-with-dependencies.jar
protobuf-java-3.19.4.jar

And I had to install some system libraries manually. This is from my script that configures all the nodes:

run_command_all "sudo apt install -y maven build-essential cmake libssl-dev libre2-dev libcurl4-openssl-dev clang lldb lld libz-dev git ninja-build uuid-dev"
run_command_all "sudo apt-get -y install build-essential g++ python-dev autotools-dev libicu-dev libbz2-dev libboost-all-dev"
run_command_all "sudo apt install -y libdouble-conversion-dev"
run_command_all "sudo apt install -y g++ cmake ccache ninja-build checkinstall git libssl-dev libboost-all-dev libdouble-conversion-dev libgoogle-glog-dev libbz2-dev libgflags-dev libgtest-dev libgmock-dev libevent-dev liblz4-dev libzstd-dev libre2-dev libsnappy-dev liblzo2-dev bison flex tzdata wget"

And finally, the spark-defaults.conf file is different:

spark.driver.memory                     5g
spark.executor.memory                   120g
spark.plugins                                           io.glutenproject.GlutenPlugin
spark.gluten.sql.columnar.backend.lib velox
spark.memory.offHeap.enabled            true
spark.memory.offHeap.size                       120g
spark.gluten.sql.columnar.forceshuffledhashjoin true
spark.shuffle.manager                           org.apache.spark.shuffle.sort.ColumnarShuffleManager

After all this, I can get master and worker nodes to start and find each other. However, when I try to run a query on the cluster, the driver hangs. None of the worker or master log files indicate any problems. The driver just stops forever. Here's the tail end of what is printed by the driver:

22/10/25 21:36:09 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/10/25 21:36:09 INFO SharedState: Warehouse path is 'file:/home/ubuntu/TpcdsApp/spark-warehouse'.
22/10/25 21:36:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.2.206:40206) with ID 2,  ResourceProfileId 0
22/10/25 21:36:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.7.85:58432) with ID 3,  ResourceProfileId 0
22/10/25 21:36:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.0.50:54604) with ID 4,  ResourceProfileId 0
22/10/25 21:36:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.12.41:56384) with ID 0,  ResourceProfileId 0
22/10/25 21:36:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.3.35:33206) with ID 1,  ResourceProfileId 0
22/10/25 21:36:11 INFO BlockManagerMasterEndpoint: Registering block manager 172.31.7.85:46691 with 183.8 GiB RAM, BlockManagerId(3, 172.31.7.85, 46691, None)
22/10/25 21:36:11 INFO BlockManagerMasterEndpoint: Registering block manager 172.31.2.206:41911 with 183.8 GiB RAM, BlockManagerId(2, 172.31.2.206, 41911, None)
22/10/25 21:36:11 INFO BlockManagerMasterEndpoint: Registering block manager 172.31.0.50:35843 with 183.8 GiB RAM, BlockManagerId(4, 172.31.0.50, 35843, None)
22/10/25 21:36:11 INFO BlockManagerMasterEndpoint: Registering block manager 172.31.12.41:46597 with 183.8 GiB RAM, BlockManagerId(0, 172.31.12.41, 46597, None)
22/10/25 21:36:11 INFO BlockManagerMasterEndpoint: Registering block manager 172.31.3.35:33087 with 183.8 GiB RAM, BlockManagerId(1, 172.31.3.35, 33087, None)
22/10/25 21:36:11 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/10/25 21:36:11 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
22/10/25 21:36:11 INFO MetricsSystemImpl: s3a-file-system metrics system started
22/10/25 21:36:13 INFO HadoopFSUtils: Listing leaf files and directories in parallel under 2004 paths. The first several paths are: s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450820, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450821, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450822, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450823, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450824, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450825, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450826, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450827, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450828, s3a://theosib-v3-tpcds-hive/warehouse/theosib_tpcds_3000_hive/data/store_returns/sr_returned_date_sk=2450829.
22/10/25 21:36:14 INFO SparkContext: Starting job: sql at TpcdsApp.scala:158
22/10/25 21:36:14 INFO DAGScheduler: Got job 0 (sql at TpcdsApp.scala:158) with 2004 output partitions
22/10/25 21:36:14 INFO DAGScheduler: Final stage: ResultStage 0 (sql at TpcdsApp.scala:158)
22/10/25 21:36:14 INFO DAGScheduler: Parents of final stage: List()
22/10/25 21:36:14 INFO DAGScheduler: Missing parents: List()
22/10/25 21:36:14 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at sql at TpcdsApp.scala:158), which has no missing parents
22/10/25 21:36:14 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 100.7 KiB, free 122.5 GiB)
22/10/25 21:36:14 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 36.0 KiB, free 122.5 GiB)
22/10/25 21:36:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-5-233.us-east-2.compute.internal:46845 (size: 36.0 KiB, free: 122.5 GiB)
22/10/25 21:36:14 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1478
22/10/25 21:36:14 INFO DAGScheduler: Submitting 2004 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at sql at TpcdsApp.scala:158) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
22/10/25 21:36:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 2004 tasks resource profile 0
theosib-amazon commented 1 year ago

If I use a vanilla Spark driver, then I can run the tests to completion. I can see that the worker and master nodes are using CPU. So it seems like something about the Gluten plugin or the configuration breaks the driver.

FelixYBW commented 1 year ago

The deployment is a big issue now. we can't put all dependency library in jar. So we have to manually install the libraries on each worker node manually now. We are creating a script to install these libraries as temp solution. In long term I'd think the conda env is the best solution. We just need to copy the conda env to each worker node and set the LD_LIBRARY_PATH.

Do you load data from S3? We haven't test S3 internally. Currently only localfs and hdfs are tested.

theosib-amazon commented 1 year ago

I install all the necessary jars on the worker nodes, and I can access S3. I just realized that if I use a vanilla driver, it's not going to do planning properly for using Velox. So we gotta get past this hang in the driver.

FelixYBW commented 1 year ago

Looks there are some issues to use S3. we will try to reproduce it and see where is the issue. @weiting-chen can you try this?

Stove-hust commented 11 months ago

I had a similar problem but different from him, I checked the stderr of the Executor and I found out that there was an error in the initialization of libvelox.so by the executor. The error message is: libvelox.so: undefined symbol: _ZN6google21kLogSiteUninitializedE I checked some information about this and it seems to be a glog version problem, the variable kLogSiteUninitialized was removed in the latest version of glog. Although gluten compiled velox specifying the glog version as 0.4.0, I turned on vcpkg and it seems to be downloading the latest 0.6.0 version of glog for me. btw, The problem is occasional, not inevitable.