NVIDIA / spark-rapids-ml

Spark RAPIDS MLlib – accelerate Apache Spark MLlib with GPUs
https://nvidia.github.io/spark-rapids-ml/
Apache License 2.0
63 stars 30 forks source link

[QST] Dealing with executors that are stuck in the barrier #453

Open an-ys opened 11 months ago

an-ys commented 11 months ago

I am trying to run the Linear Regression, KMeans, and PCA examples on a cluster of 2 nodes, each with 4 GPUs, but some of the executors in the examples always get stuck in the barrier when the cuML function is called (i.e., I get 6+2/8, 4+4/8, and 5+3/8, where 2, 4, and 3 executors are stuck in LinReg, KMeans, and PCA respectively). I also tried runing a KMeans application that deals with a large amount of data, so I do not think the problem is related to the small dataset.

I checked the logs for the executor that successfully ran the task and the executor that got stuck. The executor that got stuck initialized cuML These logs are from running the LinReg example in the Python directory of this repo. The executors that are stuck have RUNNING | NODE_LOCAL as the status while the successful executors have SUCCESS PROCESS_LOCAL.

I am using Spark RAPIDS ML branch-23.10 (daedfe56edae33c565af5e06179e992cf8fec93e and f651978a03d28ef7b3295129501da4a489709979), Spark 3.5.0 on standalone mode, and Hadoop 3.3.6 on a cluster of 2 nodes, each with 4 Titan-V GPUs.

Successful Executor ``` 23/09/27 19:42:59 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB) 23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB) 23/09/27 19:42:59 INFO TorrentBroadcast: Reading broadcast variable 3 took 13 ms 23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 192, boot = -749, init = 941, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 203, boot = -723, init = 926, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 4.0 in stage 4.0 (TID 389). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 36.0 in stage 4.0 (TID 421). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 440 23/09/27 19:43:00 INFO Executor: Running task 55.0 in stage 4.0 (TID 440) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 220, boot = -692, init = 912, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 443 23/09/27 19:43:00 INFO Executor: Running task 58.0 in stage 4.0 (TID 443) 23/09/27 19:43:00 INFO Executor: Finished task 44.0 in stage 4.0 (TID 429). 2004 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 446 23/09/27 19:43:00 INFO Executor: Running task 61.0 in stage 4.0 (TID 446) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 238, boot = -679, init = 917, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = -767, init = 1006, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 12.0 in stage 4.0 (TID 397). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 20.0 in stage 4.0 (TID 405). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 453 23/09/27 19:43:00 INFO Executor: Running task 68.0 in stage 4.0 (TID 453) 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 454 23/09/27 19:43:00 INFO Executor: Running task 69.0 in stage 4.0 (TID 454) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 280, boot = -698, init = 978, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 28.0 in stage 4.0 (TID 413). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 466 23/09/27 19:43:00 INFO Executor: Running task 81.0 in stage 4.0 (TID 466) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 159, boot = -7, init = 166, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 164, boot = -14, init = 178, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 55.0 in stage 4.0 (TID 440). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 58.0 in stage 4.0 (TID 443). 2004 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 473 23/09/27 19:43:00 INFO Executor: Running task 88.0 in stage 4.0 (TID 473) 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 474 23/09/27 19:43:00 INFO Executor: Running task 89.0 in stage 4.0 (TID 474) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 173, boot = -3, init = 176, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 68.0 in stage 4.0 (TID 453). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 479 23/09/27 19:43:00 INFO Executor: Running task 94.0 in stage 4.0 (TID 479) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 244, boot = -4, init = 248, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 61.0 in stage 4.0 (TID 446). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO PythonRunner: Times: total = 194, boot = 8, init = 186, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 489 23/09/27 19:43:00 INFO Executor: Finished task 81.0 in stage 4.0 (TID 466). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Running task 104.0 in stage 4.0 (TID 489) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 249, boot = -5, init = 254, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 494 23/09/27 19:43:00 INFO Executor: Running task 109.0 in stage 4.0 (TID 494) 23/09/27 19:43:00 INFO Executor: Finished task 69.0 in stage 4.0 (TID 454). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 499 23/09/27 19:43:00 INFO Executor: Running task 114.0 in stage 4.0 (TID 499) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 215, boot = 1, init = 214, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 89.0 in stage 4.0 (TID 474). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 507 23/09/27 19:43:00 INFO Executor: Running task 122.0 in stage 4.0 (TID 507) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 272, boot = 15, init = 256, finish = 1 23/09/27 19:43:00 INFO Executor: Finished task 88.0 in stage 4.0 (TID 473). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 6, init = 233, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 515 23/09/27 19:43:00 INFO Executor: Running task 130.0 in stage 4.0 (TID 515) 23/09/27 19:43:00 INFO Executor: Finished task 94.0 in stage 4.0 (TID 479). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 519 23/09/27 19:43:00 INFO Executor: Running task 134.0 in stage 4.0 (TID 519) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 240, boot = -7, init = 247, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 114.0 in stage 4.0 (TID 499). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO PythonRunner: Times: total = 274, boot = 0, init = 274, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 259, boot = -7, init = 266, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 536 23/09/27 19:43:00 INFO Executor: Running task 151.0 in stage 4.0 (TID 536) 23/09/27 19:43:00 INFO Executor: Finished task 104.0 in stage 4.0 (TID 489). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 109.0 in stage 4.0 (TID 494). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 537 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 538 23/09/27 19:43:00 INFO Executor: Running task 152.0 in stage 4.0 (TID 537) 23/09/27 19:43:00 INFO Executor: Running task 153.0 in stage 4.0 (TID 538) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 269, boot = 9, init = 260, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 122.0 in stage 4.0 (TID 507). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 547 23/09/27 19:43:00 INFO Executor: Running task 162.0 in stage 4.0 (TID 547) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 246, boot = -10, init = 256, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 134.0 in stage 4.0 (TID 519). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 560 23/09/27 19:43:00 INFO Executor: Running task 175.0 in stage 4.0 (TID 560) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 297, boot = 6, init = 290, finish = 1 23/09/27 19:43:00 INFO Executor: Finished task 130.0 in stage 4.0 (TID 515). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 568 23/09/27 19:43:00 INFO Executor: Running task 183.0 in stage 4.0 (TID 568) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 241, boot = 3, init = 238, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 151.0 in stage 4.0 (TID 536). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 570 23/09/27 19:43:00 INFO Executor: Running task 185.0 in stage 4.0 (TID 570) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 7, init = 232, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 152.0 in stage 4.0 (TID 537). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 571 23/09/27 19:43:00 INFO Executor: Running task 186.0 in stage 4.0 (TID 571) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 258, boot = 14, init = 244, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 153.0 in stage 4.0 (TID 538). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 574 23/09/27 19:43:01 INFO Executor: Running task 189.0 in stage 4.0 (TID 574) 23/09/27 19:43:01 INFO PythonRunner: Times: total = 215, boot = 15, init = 200, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 162.0 in stage 4.0 (TID 547). 2004 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 162, boot = -6, init = 168, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 185.0 in stage 4.0 (TID 570). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 230, boot = -5, init = 235, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 175.0 in stage 4.0 (TID 560). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 154, boot = 0, init = 154, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 189.0 in stage 4.0 (TID 574). 2004 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 244, boot = 15, init = 229, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 183.0 in stage 4.0 (TID 568). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 219, boot = 7, init = 212, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 186.0 in stage 4.0 (TID 571). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO UCX: UCX context created 23/09/27 19:43:01 INFO UCX: UCX Worker created 23/09/27 19:43:02 INFO UCX: Started UcpListener on /:57306 23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(1, , 32805, Some(rapids=57306)) with driver 23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, , 41505, Some(rapids=62205)) 23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0 23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=0) started 23/09/27 19:43:02 INFO UCX: Got UcpListener request from /:46124 23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46124) for /:46124 23/09/27 19:43:02 INFO UCX: Got UcpListener request from /:46148 23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46148) for /:46148 23/09/27 19:43:02 INFO UCX: Got UcpListener request from /:46136 23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46136) for /:46136 23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281] 23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281] 23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281] 23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46148) 23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 2: UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46136) 23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 0: UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46124) 23/09/27 19:43:02 INFO CoarseGrainedExecutorBackend: Got assigned task 577 23/09/27 19:43:02 INFO Executor: Running task 0.0 in stage 6.0 (TID 577) 23/09/27 19:43:02 INFO MapOutputTrackerWorker: Updating epoch to 4 and clearing cache 23/09/27 19:43:02 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB) 23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 13.2 KiB, free 47.8 GiB) 23/09/27 19:43:02 INFO TorrentBroadcast: Reading broadcast variable 4 took 7 ms 23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 29.0 KiB, free 47.8 GiB) 23/09/27 19:43:04 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 2, fetching them 23/09/27 19:43:04 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063) 23/09/27 19:43:04 INFO MapOutputTrackerWorker: Got the map output locations 23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 0 (0.0 B) local and 1 (97.0 B) host-local and 0 (0.0 B) push-merged-local and 1 (97.0 B) remote blocks 23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /:38739 after 2 ms (0 ms spent in bootstraps) 23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 21 ms 23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /:36401 after 1 ms (0 ms spent in bootstraps) 23/09/27 19:43:04 INFO CodeGenerator: Code generated in 48.927407 ms 23/09/27 19:43:04 INFO CodeGenerator: Code generated in 19.687474 ms 23/09/27 19:43:04 INFO Executor: Finished task 0.0 in stage 6.0 (TID 577). 4021 bytes result sent to driver 23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 584 23/09/27 19:43:05 INFO Executor: Running task 6.0 in stage 9.0 (TID 584) 23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache 23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB) 23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB) 23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 17 ms 23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB) 23/09/27 19:43:05 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them 23/09/27 19:43:05 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063) 23/09/27 19:43:05 INFO MapOutputTrackerWorker: Got the map output locations 23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms 23/09/27 19:43:05 INFO CodeGenerator: Code generated in 14.846185 ms 23/09/27 19:43:05 INFO PythonRunner: Times: total = 139, boot = -4325, init = 4464, finish = 0 23/09/27 19:43:05 INFO Executor: Finished task 6.0 in stage 9.0 (TID 584). 6740 bytes result sent to driver 23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, , 36401, Some(rapids=29068)) 23/09/27 19:43:07 INFO UCX: Creating connection for executorId 3 23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=3) started 23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(2, , 36445, Some(rapids=50893)) 23/09/27 19:43:07 INFO UCX: Creating connection for executorId 2 23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=2) started 23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog 23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1 23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 2: UCXError(-25,Connection reset by remote peer) 23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46136) for 2 23/09/27 19:46:32 WARN UCX: Removed stale client connection for 2 23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring. org.openucx.jucx.UcxException: Connection reset by remote peer at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method) at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441) at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904) at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188) at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182) at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29) at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490) at java.base/java.lang.Thread.run(Thread.java:833) 23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 3: UCXError(-25,Connection reset by remote peer) 23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:46148) for 3 23/09/27 19:46:32 WARN UCX: Removed stale client connection for 3 23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring. org.openucx.jucx.UcxException: Connection reset by remote peer at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method) at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441) at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904) at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188) at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182) at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29) at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490) at java.base/java.lang.Thread.run(Thread.java:833) 23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown 23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM tdown ```
Killed Executor ``` 23/09/27 19:42:59 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache 23/09/27 19:43:00 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB) 23/09/27 19:43:00 INFO TransportClientFactory: Successfully created connection to /:32805 after 2 ms (0 ms spent in bootstraps) 23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB) 23/09/27 19:43:00 INFO TorrentBroadcast: Reading broadcast variable 3 took 163 ms 23/09/27 19:43:00 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB) 23/09/27 19:43:01 INFO UCX: UCX context created 23/09/27 19:43:01 INFO UCX: UCX Worker created 23/09/27 19:43:02 INFO UCX: Started UcpListener on /:50893 23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(2, , 36445, Some(rapids=50893)) with driver 23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(1, , 32805, Some(rapids=57306)) 23/09/27 19:43:02 INFO UCX: Creating connection for executorId 1 23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=1) started 23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, , 41505, Some(rapids=62205)) 23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0 23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=0) started 23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, , 36401, Some(rapids=29068)) 23/09/27 19:43:02 INFO UCX: Creating connection for executorId 3 23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@6d246c2, peerExecutorId=3) started 23/09/27 19:43:02 INFO UCX: Got UcpListener request from /:57878 23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:57878) for /:57878 23/09/27 19:43:02 INFO CodeGenerator: Code generated in 370.253869 ms 23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 834, init = 307, finish = 0 23/09/27 19:43:02 INFO PythonRunner: Times: total = 1057, boot = 845, init = 212, finish = 0 23/09/27 19:43:02 INFO PythonRunner: Times: total = 1121, boot = 840, init = 281, finish = 0 23/09/27 19:43:02 INFO PythonRunner: Times: total = 1152, boot = 852, init = 300, finish = 0 23/09/27 19:43:02 INFO PythonRunner: Times: total = 1129, boot = 824, init = 305, finish = 0 23/09/27 19:43:02 INFO PythonRunner: Times: total = 1141, boot = 829, init = 312, finish = 0 23/09/27 19:43:02 INFO Executor: Finished task 29.0 in stage 4.0 (TID 414). 2090 bytes result sent to driver 23/09/27 19:43:02 INFO Executor: Finished task 13.0 in stage 4.0 (TID 398). 2090 bytes result sent to driver 23/09/27 19:43:02 INFO Executor: Finished task 5.0 in stage 4.0 (TID 390). 2090 bytes result sent to driver 23/09/27 19:43:02 INFO Executor: Finished task 37.0 in stage 4.0 (TID 422). 2090 bytes result sent to driver 23/09/27 19:43:02 INFO Executor: Finished task 45.0 in stage 4.0 (TID 430). 2090 bytes result sent to driver 23/09/27 19:43:02 INFO Executor: Finished task 21.0 in stage 4.0 (TID 406). 2090 bytes result sent to driver 23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281] 23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=140137543848256, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/:57878) 23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 578 23/09/27 19:43:05 INFO Executor: Running task 2.0 in stage 9.0 (TID 578) 23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache 23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB) 23/09/27 19:43:05 INFO TransportClientFactory: Successfully created connection to master/:33961 after 2 ms (0 ms spent in bootstraps) 23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB) 23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 20 ms 23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB) 23/09/27 19:43:07 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them 23/09/27 19:43:07 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063) 23/09/27 19:43:07 INFO MapOutputTrackerWorker: Got the map output locations 23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Getting 1 (72.0 B) non-empty blocks including 0 (0.0 B) local and 1 (72.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 23/09/27 19:43:07 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 15 ms 23/09/27 19:43:07 INFO CodeGenerator: Code generated in 19.422264 ms 2023-09-27 19:43:08,459 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context 23/09/27 19:43:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0. 23/09/27 19:44:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 60 seconds, current barrier epoch is 0. 23/09/27 19:45:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 120 seconds, current barrier epoch is 0. 23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog 23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1 23/09/27 19:46:12 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1695811392156, has been waiting for 180 seconds, current barrier epoch is 0. 23/09/27 19:46:31 INFO Executor: Executor is trying to kill task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs 23/09/27 19:46:32 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[accept-connections,5,main] org.apache.spark.TaskKilledException at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:267) at org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$3(BarrierTaskContext.scala:94) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.BarrierTaskContext.runBarrier(BarrierTaskContext.scala:94) at org.apache.spark.BarrierTaskContext.allGather(BarrierTaskContext.scala:179) at org.apache.spark.api.python.BasePythonRunner$WriterThread.barrierAndServe(PythonRunner.scala:490) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anon$1.run(PythonRunner.scala:321) 23/09/27 19:46:32 INFO Executor: Executor killed task 2.0 in stage 9.0 (TID 578), reason: Stage cancelled: Job 5 cancelled as part of cancellation of all jobs 23/09/27 19:46:32 INFO RapidsBufferCatalog: Closing storage 23/09/27 19:46:32 INFO UCXShuffleTransport: UCX transport closing 23/09/27 19:46:32 WARN UCX: UCX is shutting down 23/09/27 19:46:32 INFO UCX: De-registering UCX 3 memory buffers. 23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown 23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM tdown ```

Here is the spark.conf containing the related options. I tried to disable the options related to UDFs (Scala UDF, UDF compiler, etc.), but it did not do much.

`spark.conf` ``` spark.master spark://master:7077 # Resource-related configs spark.executor.instances 8 spark.executor.cores 6 spark.executor.memory 80G spark.driver.memory 80G spark.executor.memoryOverhead 1G # Task-related spark.default.parallelism 192 spark.sql.shuffle.partitions 192 spark.driver.maxResultSize 30G spark.sql.files.maxPartitionBytes 4096m # spark.sql.files.maxPartitionBytes 8192m spark.sql.execution.sortBeforeRepartition false spark.sql.adaptive.enabled true # GPU-related Configs spark.executor.resource.gpu.amount 1 spark.executor.resource.gpu.discoveryScript /usr/lib/spark/scripts/gpu/getGpusResources.sh spark.executor.resources.discoveryPlugin com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin spark.plugins com.nvidia.spark.SQLPlugin spark.rapids.memory.gpu.debug STDOUT spark.rapids.memory.gpu.pool NONE spark.rapids.memory.pinnedPool.size 20G spark.rapids.shuffle.multiThreaded.reader.threads 24 spark.rapids.shuffle.multiThreaded.writer.threads 24 spark.rapids.sql.concurrentGpuTasks 2 spark.rapids.sql.enabled true spark.rapids.sql.exec.CollectLimitExec true spark.rapids.sql.explain all spark.rapids.sql.expression.ScalaUDF true spark.rapids.sql.metrics.level DEBUG spark.rapids.sql.rowBasedUDF.enabled true spark.rapids.sql.udfCompiler.enabled true spark.shuffle.manager com.nvidia.spark.rapids.spark350.RapidsShuffleManager spark.task.resource.gpu.amount 0.166 spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer spark.rapids.shuffle.mode UCX spark.shuffle.service.enabled false spark.dynamicAllocation.enabled false spark.executorEnv.UCX_ERROR_SIGNALS spark.executorEnv.UCX_MEMTYPE_CACHE n spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024 spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,rc,tcp spark.executorEnv.UCX_RNDV_SCHEME put_zcopy spark.executorEnv.UCX_MAX_RNDV_RAILS 1 spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA n ```
abellina commented 11 months ago

@an-ys

I am sorry your application isn't making progress. I'll chime in from the spark-rapids side, specifically around the shuffle (for now).

I see that you have configured UCX shuffle. Do you have RDMA-capable networking? Or, do you have several GPUs in one box? If you don't have these things, then we recommend the MULTITHREADED shuffle mode (default). Note also that these configs are only for the MULTITHREADED shuffle:

spark.rapids.shuffle.multiThreaded.reader.threads   24
spark.rapids.shuffle.multiThreaded.writer.threads   24

I would start by removing the RapidsShuffleManager to see if that unblocks your job. If it works there, then it's likely either something to do with the UCX shuffle trying to work with your hardware/OS, or a bug in it.

Also, if you do get another hang, getting a jstack of the executors is useful. You can access a stack trace under the Spark UI in the Executors tab (click on "Thread Dump").

eordentlich commented 11 months ago

Also, looks like you may have pasted the successful executor log twice as the logs look identical. Would be great to see the bad one when you get a chance.

an-ys commented 11 months ago

Thanks for your reply. I have fixed the executor logs.

I don't remember setting up RDMA, but I have the RDMA packages are installed and I can run ucx_perftest with the example on the docs: CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda (I don't think it's related, but I tried disabling UCX_IB_GPU_DIRECT_RDMA for ucx_perftest, and I got unused environment variable: UCX_IB_GPU_DIRECT_RDMA (maybe: UCX_IB_GPU_DIRECT_RDMA?)). The nodes are connected via a switch and regular Ethernet (no Infiniband).

~Removing the multithreaded configs gave me "java.lang.IllegalStateException: The ShuffleBufferCatalog is not initialized but the RapidsShuffleManager is configured". When I switched from UCX mode to MULTITHREADED, I stopped getting that error, it still hangs like before.~ When I changed back to UCX again later on, UCX works without the multithreaded configs.

I tried running the LinReg and KMeans example again with an updated configuration where I commented most of the shuffle-related configuration. Here is the new configuration without the configs for the history server, Python/Java, and timeout values.

New Config ``` spark.master spark://master:7077 # Resource-related configs spark.executor.instances 8 spark.executor.cores 6 spark.executor.memory 80G spark.driver.memory 80G spark.executor.memoryOverhead 1G # Task-related spark.default.parallelism 192 spark.sql.shuffle.partitions 192 spark.driver.maxResultSize 30G spark.sql.files.maxPartitionBytes 4096m spark.sql.execution.sortBeforeRepartition false spark.sql.adaptive.enabled true # GPU-related Configs spark.executor.resource.gpu.amount 1 spark.executor.resource.gpu.discoveryScript /usr/lib/spark/scripts/gpu/getGpusResources.sh spark.executor.resources.discoveryPlugin com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin spark.plugins com.nvidia.spark.SQLPlugin spark.rapids.memory.gpu.debug STDOUT spark.rapids.memory.gpu.pool NONE spark.rapids.memory.pinnedPool.size 20G spark.rapids.sql.concurrentGpuTasks 2 spark.rapids.sql.enabled true spark.rapids.sql.explain all spark.rapids.sql.expression.ScalaUDF true spark.rapids.sql.metrics.level DEBUG spark.rapids.sql.rowBasedUDF.enabled true spark.rapids.sql.udfCompiler.enabled true spark.task.resource.gpu.amount 0.166 ```

The KMeans example does not have any executors that were killed, but 4 executors, which are on the same node, hang. Also, there are two pending stages: stage 7 with 192 tasks and stage 8 with 1 task with the same description (javaToPython at NativeMethodAccessorImpl.java:0).

``` org.apache.spark.sql.Dataset.javaToPython(Dataset.scala:4151) java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.base/java.lang.reflect.Method.invoke(Method.java:568) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) py4j.Gateway.invoke(Gateway.java:282) py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) py4j.commands.CallCommand.execute(CallCommand.java:79) py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) py4j.ClientServerConnection.run(ClientServerConnection.java:106) java.base/java.lang.Thread.run(Thread.java:833) ```

The status for Stage 9, the barrier on the CLI is (4 + 4) / 8, but it's "3/8 (5 running)" on the Spark UI. It's the same thing for linear regression with stage 7 and stage 8 as pending stages and stage 9 as an active stage with "(6 + 2) / 8". The executors were not killed, so there is no thread dump. Interestingly enough, the executors that hang always have the same index on Spark UI (indices 2 and 3).

Executor with SUCCESS state ``` 23/10/03 04:52:20 INFO Executor: Running task 6.0 in stage 9.0 (TID 584) 23/10/03 04:52:20 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache 23/10/03 04:52:21 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB) 23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.0 KiB, free 47.8 GiB) 23/10/03 04:52:21 INFO TorrentBroadcast: Reading broadcast variable 5 took 11 ms 23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 57.3 KiB, free 47.8 GiB) 23/10/03 04:52:23 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them 23/10/03 04:52:23 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:42159) 23/10/03 04:52:23 INFO MapOutputTrackerWorker: Got the map output locations 23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms 23/10/03 04:52:23 INFO CodeGenerator: Code generated in 20.623421 ms 23/10/03 04:52:23 INFO PythonRunner: Times: total = 163, boot = -4851, init = 5014, finish = 0 23/10/03 04:52:23 INFO Executor: Finished task 6.0 in stage 9.0 (TID 584). 6826 bytes result sent to driver ```
Executor with RUNNING state ``` 23/10/03 04:52:20 INFO Executor: Running task 2.0 in stage 9.0 (TID 578) 23/10/03 04:52:21 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache 23/10/03 04:52:21 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB) 23/10/03 04:52:21 INFO TransportClientFactory: Successfully created connection to /:35029 after 3 ms (0 ms spent in bootstraps) 23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.0 KiB, free 47.8 GiB) 23/10/03 04:52:21 INFO TorrentBroadcast: Reading broadcast variable 5 took 39 ms 23/10/03 04:52:21 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 57.3 KiB, free 47.8 GiB) 23/10/03 04:52:23 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them 23/10/03 04:52:23 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:42159) 23/10/03 04:52:23 INFO MapOutputTrackerWorker: Got the map output locations 23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Getting 1 (80.0 B) non-empty blocks including 0 (0.0 B) local and 1 (80.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 23/10/03 04:52:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 12 ms 23/10/03 04:52:23 INFO TransportClientFactory: Successfully created connection to /:34293 after 2 ms (0 ms spent in bootstraps) 23/10/03 04:52:23 INFO CodeGenerator: Code generated in 19.602622 ms 2023-10-03 04:52:24,307 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context 23/10/03 04:52:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0. 23/10/03 04:53:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 60 seconds, current barrier epoch is 0. 23/10/03 04:54:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 120 seconds, current barrier epoch is 0. 23/10/03 04:55:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 180 seconds, current barrier epoch is 0. 23/10/03 04:56:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 240 seconds, current barrier epoch is 0. 23/10/03 04:57:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 300 seconds, current barrier epoch is 0. 23/10/03 04:58:27 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696276347468, has been waiting for 360 seconds, current barrier epoch is 0. ```

I thought the executors were not killed because it was not using UCX, but when using UCX again, the 2 executors were not killed. I am not sure if it's because I updated the RAPIDS packages on both servers before attempting to run the servers again. The two executors failed on the other node for linear regression, which did not happen before, and this node has log4j level set to TRACE, so there is more information.

Executor with SUCCESS state ``` 23/10/03 05:30:04 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ: 182B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 b6 03 56 7f 26 69 be cb 8a |.........V.&i...| |00000010| a1 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB| |00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro| |00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov| |000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...| |000000b0| 00 00 00 00 02 01 |...... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=6232742644918487713,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 4096)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 2 23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 2 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2_piece0 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2_piece0 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2_piece0 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2_piece0 of size 9556 dropped from memory (free 51350724480) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2_piece0 23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 5985864061522782634 to master/:33881 took 2 ms 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ: 68B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 44 04 53 12 0f a6 44 d7 a9 |.......D.S...D..| |00000010| aa 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |..../....sr..jav| |00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. | |00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu| |00000040| 65 78 70 01 |exp. | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=5985864061522782634,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 2048)]] 23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_2_piece0 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_2_piece0 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2 of size 19984 dropped from memory (free 51350744464) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 2, response is 0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=6232742644918487713,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ: 154B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 9a 03 74 22 73 59 a3 e9 19 |.........t"sY...| |00000010| b8 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T| |00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff| |00000090| 6c 65 49 64 78 70 00 00 00 01 |leIdxp.... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=8368377886447049144,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 2048)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ: 182B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 b6 03 40 01 9c 7e f4 09 bc |.........@..~...| |00000010| b3 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB| |00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro| |00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov| |000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...| |000000b0| 00 00 00 00 04 01 |...... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=4612139562478189747,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 1024)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 1 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 4 23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 4 23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 4, response is 0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=4612139562478189747,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/:33881 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 1, response is true 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=8368377886447049144,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ: 154B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 9a 03 68 c8 77 89 5a fa 62 |.........h.w.Z.b| |00000010| 24 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |$.......master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T| |00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff| |00000090| 6c 65 49 64 78 70 00 00 00 00 |leIdxp.... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=7550416207107220004,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 1024)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ: 182B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 b6 03 4f 7d 27 9d e0 33 d3 |.........O}'..3.| |00000010| 28 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |(.......master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB| |00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro| |00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov| |000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...| |000000b0| 00 00 00 00 03 01 |...... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 0 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=5727777860109980456,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 512)]] 23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 0 from shuffle buffer catalog 23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 0 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 3 23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 3 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3 of size 19984 dropped from memory (free 51350764448) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3_piece0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 0, response is true 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3_piece0 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3_piece0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3_piece0 of size 9556 dropped from memory (free 51350774004) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3_piece0 23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=7550416207107220004,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 8609448405931561762 to master/:33881 took 1 ms 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ: 68B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 44 04 77 7a e7 b5 db 9a 93 |.......D.wz.....| |00000010| 22 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |".../....sr..jav| |00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. | |00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu| |00000040| 65 78 70 01 |exp. | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=8609448405931561762,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 512)]] 23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0 23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_3_piece0 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 3, response is 0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0xded1e34d, L:/:35312 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=5727777860109980456,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/:33881 ```
Executor with RUNNING state ``` 2023-10-03 05:30:05,390 - spark_rapids_ml.regression.LinearRegression - INFO - Initializing cuml context 23/10/03 05:30:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) has entered the global sync, current barrier epoch is 0. 23/10/03 05:30:08 TRACE BarrierTaskContext: Current callSite: CallSite($anonfun$runBarrier$2 at Logging.scala:68,org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$2(BarrierTaskContext.scala:61) org.apache.spark.internal.Logging.logTrace(Logging.scala:68) org.apache.spark.internal.Logging.logTrace$(Logging.scala:67) org.apache.spark.BarrierTaskContext.logTrace(BarrierTaskContext.scala:43) org.apache.spark.BarrierTaskContext.runBarrier(BarrierTaskContext.scala:61) org.apache.spark.BarrierTaskContext.allGather(BarrierTaskContext.scala:179) org.apache.spark.api.python.BasePythonRunner$WriterThread.barrierAndServe(PythonRunner.scala:490) org.apache.spark.api.python.BasePythonRunner$WriterThread$$anon$1.run(PythonRunner.scala:321)) 23/10/03 05:30:08 TRACE TransportClient: Sending RPC to master/:33881 23/10/03 05:30:08 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 1062] 23/10/03 05:30:08 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:30:08 TRACE TransportClient: Sending request RPC 8875612564211632373 to master/:33881 took 2 ms 23/10/03 05:31:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 60 seconds, current barrier epoch is 0. 23/10/03 05:32:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 120 seconds, current barrier epoch is 0. 23/10/03 05:33:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 180 seconds, current barrier epoch is 0. 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ: 182B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 b6 03 7f 04 b1 1a 18 87 58 |...............X| |00000010| 0b 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB| |00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro| |00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov| |000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...| |000000b0| 00 00 00 00 02 01 |...... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9152635068362479627,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 16384)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 2 23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 2 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2_piece0 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2_piece0 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2_piece0 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2_piece0 of size 9556 dropped from memory (free 51350724480) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2_piece0 23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 7156121186595844841 to master/:33881 took 1 ms 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ: 68B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 44 04 63 4f a6 4d f0 b1 32 |.......D.cO.M..2| |00000010| e9 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |..../....sr..jav| |00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. | |00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu| |00000040| 65 78 70 01 |exp. | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=7156121186595844841,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 8192)]] 23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_2_piece0 23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_2_piece0 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_2 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_2 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_2 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_2 of size 19984 dropped from memory (free 51350744464) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_2 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 2, response is 0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9152635068362479627,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ: 154B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 9a 03 47 c0 63 47 8c a2 90 |.........G.cG...| |00000010| ee 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T| |00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff| |00000090| 6c 65 49 64 78 70 00 00 00 01 |leIdxp.... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=5170241531174621422,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 8192)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ: 182B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 b6 03 7e 26 0d 61 e3 f2 d3 |.........~&.a...| |00000010| 0b 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB| |00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro| |00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov| |000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...| |000000b0| 00 00 00 00 04 01 |...... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9089967611981255435,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 4096)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 1 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 4 23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 4 23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 4, response is 0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9089967611981255435,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/:33881 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 1, response is true 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=5170241531174621422,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ: 154B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 9a 03 7e 30 ec 87 fe e7 e9 |.........~0.....| |00000010| b1 00 00 00 85 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3b 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.;org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 53 68 75 66 66 6c 65 ac e0 54 |RemoveShuffle..T| |00000080| fc 85 72 57 70 02 00 01 49 00 09 73 68 75 66 66 |..rWp...I..shuff| |00000090| 6c 65 49 64 78 70 00 00 00 00 |leIdxp.... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9093027716502383025,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 154, cap: 4096)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing shuffle 0 23/10/03 05:33:20 INFO RapidsShuffleInternalManager: Unregistering shuffle 0 from shuffle buffer catalog 23/10/03 05:33:20 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 0 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ: 182B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 b6 03 7e bc 9a a7 9c 90 c2 |.........~......| |00000010| ea 00 00 00 a1 01 00 06 6d 61 73 74 65 72 00 00 |........master..| |00000020| 84 59 00 00 15 42 6c 6f 63 6b 4d 61 6e 61 67 65 |.Y...BlockManage| |00000030| 72 45 6e 64 70 6f 69 6e 74 31 ac ed 00 05 73 72 |rEndpoint1....sr| |00000040| 00 3d 6f 72 67 2e 61 70 61 63 68 65 2e 73 70 61 |.=org.apache.spa| |00000050| 72 6b 2e 73 74 6f 72 61 67 65 2e 42 6c 6f 63 6b |rk.storage.Block| |00000060| 4d 61 6e 61 67 65 72 4d 65 73 73 61 67 65 73 24 |ManagerMessages$| |00000070| 52 65 6d 6f 76 65 42 72 6f 61 64 63 61 73 74 42 |RemoveBroadcastB| |00000080| 56 16 a3 29 3b 95 6d 02 00 02 4a 00 0b 62 72 6f |V..);.m...J..bro| |00000090| 61 64 63 61 73 74 49 64 5a 00 10 72 65 6d 6f 76 |adcastIdZ..remov| |000000a0| 65 46 72 6f 6d 44 72 69 76 65 72 78 70 00 00 00 |eFromDriverxp...| |000000b0| 00 00 00 00 03 01 |...... | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcRequest: RpcRequest[requestId=9132344189077471978,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 182, cap: 2048)]] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: removing broadcast 3 23/10/03 05:33:20 DEBUG BlockManager: Removing broadcast 3 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3 of size 19984 dropped from memory (free 51350764448) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3 23/10/03 05:33:20 DEBUG BlockManager: Removing block broadcast_3_piece0 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to acquire write lock for broadcast_3_piece0 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 acquired write lock for broadcast_3_piece0 23/10/03 05:33:20 DEBUG MemoryStore: Block broadcast_3_piece0 of size 9556 dropped from memory (free 51350774004) 23/10/03 05:33:20 TRACE BlockInfoManager: Task -1024 trying to remove block broadcast_3_piece0 23/10/03 05:33:20 TRACE TransportClient: Sending RPC to master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 181] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportClient: Sending request RPC 9140343881297110183 to master/:33881 took 1 ms 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ: 68B +-------------------------------------------------+ | 0 1 2 3 4 5 6 7 8 9 a b c d e f | +--------+-------------------------------------------------+----------------+ |00000000| 00 00 00 00 00 00 00 44 04 7e d9 06 55 19 a1 24 |.......D.~..U..$| |00000010| a7 00 00 00 2f ac ed 00 05 73 72 00 11 6a 61 76 |..../....sr..jav| |00000020| 61 2e 6c 61 6e 67 2e 42 6f 6f 6c 65 61 6e cd 20 |a.lang.Boolean. | |00000030| 72 80 d5 9c fa ee 02 00 01 5a 00 05 76 61 6c 75 |r........Z..valu| |00000040| 65 78 70 01 |exp. | +--------+-------------------------------------------------+----------------+ 23/10/03 05:33:20 TRACE MessageDecoder: Received message RpcResponse: RpcResponse[requestId=9140343881297110183,body=NettyManagedBuffer[buf=PooledUnsafeDirectByteBuf(ridx: 21, widx: 68, cap: 2048)]] 23/10/03 05:33:20 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0 23/10/03 05:33:20 DEBUG BlockManager: Told master about block broadcast_3_piece0 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing shuffle 0, response is true 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] READ COMPLETE 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: true to master:33881 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Done removing broadcast 3, response is 0 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 47] 23/10/03 05:33:20 DEBUG BlockManagerStorageEndpoint: Sent response: 0 to master:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9093027716502383025,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]]] to client master/:33881 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] WRITE: MessageWithHeader [headerLength: 21, bodyLength: 81] 23/10/03 05:33:20 TRACE NettyLogger: [id: 0x0b71ce21, L:/:35324 - R:master/:33881] FLUSH 23/10/03 05:33:20 TRACE TransportRequestHandler: Sent result RpcResponse[requestId=9132344189077471978,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=128]]] to client master/:33881 23/10/03 05:34:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 240 seconds, current barrier epoch is 0. 23/10/03 05:35:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 300 seconds, current barrier epoch is 0. 23/10/03 05:36:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 360 seconds, current barrier epoch is 0. 23/10/03 05:37:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 420 seconds, current barrier epoch is 0. 23/10/03 05:38:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 480 seconds, current barrier epoch is 0. 23/10/03 05:39:08 INFO BarrierTaskContext: Task 578 from Stage 9(Attempt 0) waiting under the global sync since 1696278608670, has been waiting for 540 seconds, current barrier epoch is 0. ```

For some reason, cuML context is only initialized on the running/killed executors.

an-ys commented 11 months ago

Hmm, so the examples worked after increasing the size of the dataset, even for UCX. I'm not sure why my application, which uses a large dataset did not work then. It might be a different issue. The problem with my application is that the status is "0/8" during the barrier stage, and I keep getting messages that there are zero slots available. I tried repartitioning the number of tasks to a smaller number, but it didn't work. I will send another comment later once I get the application fixed because my application stopped working altogether.

eordentlich commented 11 months ago

Thanks for the additional updates and glad there are some signs of it working. In looking back at your previous executor logs, it is actually the executors without the Initializing cuml context log statements that are problematic. Somehow they are completely bypassing execution of core spark-rapids-ml code and completing their respective barrier tasks. I don't see how that could happen and would be very interesting for us to reproduce if possible.

When you get back to this, please also share the worker(s) and master startup configs. Looks like spark standalone mode.

an-ys commented 11 months ago

Sorry for the late reply. Yes, I am using Spark Standalone mode.

I have the following "spark-env.sh" on each node:

spark-env.sh ``` JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64/ # Just updated this today HADOOP_CONF_DIR=/home///fr/hadoop-3.3/etc/hadoop SPARK_LOCAL_IP=master unset SPARK_REMOTE SPARK_MASTER_HOST=master SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=4 -Dspark.worker.resource.gpu.discoveryScript=/usr/lib/spark/scripts/gpu/getGpusResources.sh" export PYSPARK_PYTHON=/home//mambaforge/envs/$CONDA_DEFAULT_ENV/bin/python3 export PYSPARK_DRIVER_PYTHON=/home//mambaforge/envs/$CONDA_DEFAULT_ENV/bin/python3 ```

The initial configuration for each node is almost identical to my app's spark.conf:

spark-defaults.sh ``` spark.master spark://master:7077 spark.sql.adaptive.enabled true spark.executor.instances 8 spark.executor.cores 6 spark.default.parallelism 24 spark.sql.shuffle.partitions 24 spark.executor.memory 80G spark.driver.memory 80G spark.executor.memoryOverhead 1G spark.sql.files.maxPartitionBytes 512m spark.driver.maxResultSize 30G spark.eventLog.enabled true spark.eventLog.dir file:///tmp/spark-events spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory file:///tmp/spark-events spark.history.fs.update.interval 10s spark.history.ui.port 18080 spark.logConf true spark.executor.heartbeatInterval 10000s spark.network.timeout 10001s spark.sql.broadcastTimeout 10000 ```

I am not sure if it is intended behaviour to hang when the dataset is too small to be in every executor. I tried updating Spark RAPIDS ML to the newest version containing the #464 commit, but it still does not work. As mentioned before, if I used the LinReg example as is on the Python README.md with df as df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))], ["label", "weight", "features"]), it hangs, but when I did df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))]*4, ["label", "weight", "features"]), the application runs.

For my applications, I noticed that I can get it to work now when using "multi-threaded mode" for the shuffle manager instead of UCX. Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877 if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.

Anyway, I noticed that when I run my PCA application with a small dataset (8.1KB on hdfs dfs -du -s -h but input size is much smaller in Spark UI), the application hangs at the barrier stage. When I used two datasets that are larger than the first dataset (97.9KB and 7.7MB), it gets past the barrier. However, I noticed that the barrier stage gets stuck at "(0 + 8) / 8" for about 25s for the 97.9KB dataset and about 15s for the 7.7MB dataset, and then the stage suddenly ends without any errors.

eordentlich commented 11 months ago

I am not sure if it is intended behaviour to hang when the dataset is too small to be in every executor. I tried updating Spark RAPIDS ML to the newest version containing the #464 commit, but it still does not work. As mentioned before, if I used the LinReg example as is on the Python README.md with df as df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))], ["label", "weight", "features"]), it hangs, but when I did df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))]*4, ["label", "weight", "features"]), the application runs.

Currently the expected/intended behavior with empty partitions during training is for the tasks receiving no data to raise exceptions which should fail the whole barrier stage doing the training. It is strange, and would be a bug, that this doesn't seem to be happening in your case. I'm not able to reproduce this for some reason. I'm also not able to reproduce barrier tasks not logging anything spark-rapids-ml related before exiting. This would mean that spark-rapids-ml udf is not being invoked at all for the partition for that task. I've attempted to test if Spark might have an optimization that avoids invoking mapInPandas on partitions with no data, but so far I'm not able to trigger this on some toy examples, if it is even the case.

Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877 if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.

The spark-rapids plugin recommends JDK8. See https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#apache-spark-setup-for-gpu. @abellina is that still the case? In any case, looks like you might be running into different JDK versions being invoked in different places within the application run.

However, I noticed that the barrier stage gets stuck at "(0 + 8) / 8" for about 25s for the 97.9KB dataset and about 15s for the 7.7MB dataset, and then the stage suddenly ends without any errors.

This is actually the normal behavior. The notation means that 8 of the barrier tasks are running and it is during this time that the GPUs are carrying out the distributed computation and communicating directly with each other. If it is stuck with say "(6 + 2) / 8" then 6 barrier tasks exited for some reason, without syncing and two are hanging. That is problematic and would be a bug, even with empty data partitions, as mentioned above.

abellina commented 11 months ago

Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877 if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.

The spark-rapids plugin recommends JDK8. See https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#apache-spark-setup-for-gpu. @abellina is that still the case? In any case, looks like you might be running into different JDK versions being invoked in different places within the application run.

The spark-rapids plugin is being tested with JDK 8, 11, and 17.

I think an issue here is likely that the java that the executor is seeing is different than the one that the driver is seeing. Make sure that the spark-rapids jar is the same exactly from all places.

an-ys commented 10 months ago

Thanks for your replies. I downgraded my JDK version to 17 since I compiled Spark RAPIDS with JDK 17, as it uses Security Manager, and I was also facing an issue where I am getting "Found an unexpected context classloader" when using Scala Spark.

After downgrading to JDK 17 and updating RAPIDS to 23.12.00a, I am not facing "java.io.InvalidClassException" anymore on PySpark.

For the original issue, my guess is that the problem is from building Spark RAPIDS incorrectly? I tried using the JARs from mvn verify and mvn install, but I am using mvn install as of now since I cannot get mvn verify to work for now (as Maven is trying to look for the snapshot version of spark-rapids-jni on a repository and I have not built it yet.). Anyway, I noticed that I can run the PySpark applications, but I cannot run spark-shell because it cannot find the OptimizerPlugin. I am not sure why it works for PySpark and why it does not on work on Scala Spark anymore, and whether it's related to this issue on PySpark since I do not face any other issues on PySpark aside from the problem with the empty partitions.

Full Error Log from launching spark-shell ``` Spark Executor Command: "/usr/lib/jvm/java-17-openjdk-amd64//bin/java" "-cp" "/home//.m2/repository/com/nvidia/rapids-4-spark_2.12/23.12.0-SNAPSHOT/rapids-4-spark_2.12-23.12.0-SNAPSHOT-cuda12.jar:/home//.m2/repository/org/apache/mahout/mahout-core/0.9/mahout-core-0.9.jar:/home//.m2/repository/org/apache/mahout/mahout-math/0.9/mahout-math-0.9.jar::/home//fr/spark-3.5.0/conf/:/home//fr/spark-3.5/assembly/target/scala-2.12/jars/*:/home//fr/hadoop-3.3/etc/hadoop/" "-Xmx81920M" "-Dspark.network.timeout=10001s" "-Dspark.history.ui.port=18080" "-Dspark.driver.port=40191" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "-Duser.timezone=UTC -Dai.rapids.cudf.nvtx.enabled=true -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:-UseGCOverheadLimit" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@master:40191" "--executor-id" "0" "--hostname" "" "--cores" "6" "--app-id" "app-20231020180216-0010" "--worker-url" "spark://Worker@:39051" "--resourceProfileId" "0" "--resourcesFile" "/home//fr/spark-3.5/work/app-20231020180216-0010/0/resource-executor-5513666133497122425.json" ======================================== 23/10/20 18:02:18 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 61111@titanv3 23/10/20 18:02:18 INFO SignalUtils: Registering signal handler for TERM 23/10/20 18:02:18 INFO SignalUtils: Registering signal handler for HUP 23/10/20 18:02:18 INFO SignalUtils: Registering signal handler for INT 23/10/20 18:02:18 INFO SecurityManager: Changing view acls to: 23/10/20 18:02:18 INFO SecurityManager: Changing modify acls to: 23/10/20 18:02:18 INFO SecurityManager: Changing view acls groups to: 23/10/20 18:02:18 INFO SecurityManager: Changing modify acls groups to: 23/10/20 18:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: ; groups with view permissions: EMPTY; users with modify permissions: ; groups with modify permissions: EMPTY 23/10/20 18:02:19 INFO TransportClientFactory: Successfully created connection to master/:40191 after 93 ms (0 ms spent in bootstraps) 23/10/20 18:02:19 INFO SecurityManager: Changing view acls to: 23/10/20 18:02:19 INFO SecurityManager: Changing modify acls to: 23/10/20 18:02:19 INFO SecurityManager: Changing view acls groups to: 23/10/20 18:02:19 INFO SecurityManager: Changing modify acls groups to: 23/10/20 18:02:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: ; groups with view permissions: EMPTY; users with modify permissions: ; groups with modify permissions: EMPTY 23/10/20 18:02:19 INFO TransportClientFactory: Successfully created connection to master/:40191 after 2 ms (0 ms spent in bootstraps) 23/10/20 18:02:19 INFO DiskBlockManager: Created local directory at /tmp/spark-15c1c3d7-e779-414a-a3db-f397a5550c73/executor-ce33b3e1-b1ee-4ec2-b7e5-d2069a9e9731/blockmgr-4502e0c2-bf08-48d8-994d-e639c75137f8 23/10/20 18:02:19 INFO MemoryStore: MemoryStore started with capacity 47.8 GiB 23/10/20 18:02:20 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@master:40191 23/10/20 18:02:20 INFO WorkerWatcher: Connecting to worker spark://Worker@:39051 23/10/20 18:02:20 INFO TransportClientFactory: Successfully created connection to /:39051 after 3 ms (0 ms spent in bootstraps) 23/10/20 18:02:20 INFO WorkerWatcher: Successfully connected to spark://Worker@:39051 23/10/20 18:02:20 INFO ResourceUtils: ============================================================== 23/10/20 18:02:20 INFO ResourceUtils: Custom resources for spark.executor: gpu -> [name: gpu, addresses: 0] 23/10/20 18:02:20 INFO ResourceUtils: ============================================================== 23/10/20 18:02:20 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 23/10/20 18:02:20 INFO Executor: Starting executor ID 0 on host 23/10/20 18:02:20 INFO Executor: OS info Linux, 6.5.0-9-generic, amd64 23/10/20 18:02:20 INFO Executor: Java version 17.0.9-ea 23/10/20 18:02:20 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42907. 23/10/20 18:02:20 INFO NettyBlockTransferService: Server created on :42907 23/10/20 18:02:20 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 23/10/20 18:02:20 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, , 42907, None) 23/10/20 18:02:20 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, , 42907, None) 23/10/20 18:02:20 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, , 42907, None) 23/10/20 18:02:20 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): '' 23/10/20 18:02:20 INFO Executor: Using REPL class URI: spark://master:40191/classes 23/10/20 18:02:20 INFO Executor: Created or updated repl class loader org.apache.spark.executor.ExecutorClassLoader@61296497 for default. 23/10/20 18:02:20 INFO ShimLoader: Loading shim for Spark version: 3.5.0 23/10/20 18:02:20 INFO ShimLoader: Complete Spark build info: 3.5.0, , , , 2023-10-19T10:35:55Z 23/10/20 18:02:20 INFO ShimLoader: findURLClassLoader hit the Boostrap classloader org.apache.spark.executor.ExecutorClassLoader@61296497, failed to find a mutable classloader! 23/10/20 18:02:20 WARN ShimLoader: Found an unexpected context classloader org.apache.spark.executor.ExecutorClassLoader@61296497. We will try to recover from this, but it may cause class loading problems. 23/10/20 18:02:37 INFO RapidsPluginUtils: RAPIDS Accelerator build: {date=2023-10-20T08:25:10Z, cudf_version=23.12.0-SNAPSHOT, version=23.12.0-SNAPSHOT, user=, branch=branch-23.12, url=https://github.com/NVIDIA/spark-rapids.git, revision=fba1a4b199230fd56a62cc436e08cf3af8415c20} 23/10/20 18:02:37 INFO RapidsPluginUtils: RAPIDS Accelerator JNI build: {date=2023-10-19T06:58:21Z, version=23.12.0-SNAPSHOT, user=, branch=HEAD, url=https://github.com/NVIDIA/spark-rapids-jni.git, revision=b511a5900604ec000b5cdfb00a0a7516847f79a7} 23/10/20 18:02:37 INFO RapidsPluginUtils: cudf build: {date=2023-10-19T06:58:21Z, version=23.12.0-SNAPSHOT, user=, branch=HEAD, url=https://github.com/rapidsai/cudf.git, revision=7aa757959b5b597a0258c955fbeadb92cfb2e762} 23/10/20 18:02:37 WARN RapidsPluginUtils: RAPIDS Accelerator 23.12.0-SNAPSHOT using cudf 23.12.0-SNAPSHOT. 23/10/20 18:02:37 INFO TransportClientFactory: Successfully created connection to master/:40191 after 2 ms (0 ms spent in bootstraps) 23/10/20 18:02:37 ERROR RapidsExecutorPlugin: Exception in the executor plugin, shutting down! java.lang.ExceptionInInitializerError at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:363) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:125) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.internal.plugin.ExecutorPluginContainer.(PluginContainer.scala:113) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:211) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:199) at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:337) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:178) at org.apache.spark.executor.Executor.(Executor.scala:337) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:174) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.OptimizerPlugin at org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:124) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:467) at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41) at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36) at org.apache.spark.util.Utils$.classForName(Utils.scala:94) at org.apache.spark.sql.rapids.execution.TrampolineUtil$.classForName(TrampolineUtil.scala:171) at com.nvidia.spark.rapids.RapidsPluginUtils$.$anonfun$loadExtensions$1(Plugin.scala:235) at scala.collection.immutable.Stream.flatMap(Stream.scala:489) at com.nvidia.spark.rapids.RapidsPluginUtils$.loadExtensions(Plugin.scala:233) at com.nvidia.spark.rapids.RapidsPluginUtils$.getExtraPlugins(Plugin.scala:272) at com.nvidia.spark.rapids.RapidsPluginUtils$.(Plugin.scala:85) at com.nvidia.spark.rapids.RapidsPluginUtils$.(Plugin.scala) ... 24 more Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.OptimizerPlugin at java.base/java.lang.ClassLoader.findClass(ClassLoader.java:723) at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592) at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) at org.apache.spark.executor.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:109) ... 38 more 23/10/20 18:02:37 INFO MemoryStore: MemoryStore cleared 23/10/20 18:02:37 INFO BlockManager: BlockManager stopped 23/10/20 18:03:07 WARN ShutdownHookManager: ShutdownHook '' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204) at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) 23/10/20 18:03:07 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) at java.base/java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1464) at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60) at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:199) at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:196) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:196) at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331) at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:99) at org.apache.spark.executor.Executor.stop(Executor.scala:443) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:90) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) ```
abellina commented 10 months ago

@an-ys thanks for the report, we are looking into it (https://github.com/NVIDIA/spark-rapids/issues/9498). It seems to be an issue with our Spark 3.5.0 shim, specific to spark-shell (pyspark shell, spark-submit don't exhibit this behavior)

Please note we don't recommend using 23.12 unless you are testing some cutting edge feature, as it's not released yet. 23.10 isn't entirely released yet either.

An option if you want to try to build on your own is to set -DallowConventionalDistJar=true. This will sidestep the issue while we get it fixed. I confirmed that adding this to my mvn command built a jar that was loaded successfully by spark 3.5.0 (mvn package -Dbuildver=350 -DskipTests -DallowConventionalDistJar=true).

eordentlich commented 10 months ago

@an-ys The original hang issue is due to a combination of empty partitions and an optimization in how spark-rapids etl plugin handles mapInPandas vs baseline Spark (which is what we had used to test empty partitions). See https://github.com/NVIDIA/spark-rapids/issues/9480 . The spark-rapids version of mapInPandas does not execute the udf on empty partitions and hence the spark-rapids-ml barrier is never entered for those tasks, leaving the other tasks (with data) hanging. Note that even after that issue is resolved to have spark-rapids mapInPandas match baseline spark mapInPandas behavior on empty partitions, and thereby avoid hanging, an exception would still be raised currently by spark-rapids-ml in the case of empty partitions.