Open an-ys opened 11 months ago
@an-ys
I am sorry your application isn't making progress. I'll chime in from the spark-rapids side, specifically around the shuffle (for now).
I see that you have configured UCX shuffle. Do you have RDMA-capable networking? Or, do you have several GPUs in one box? If you don't have these things, then we recommend the MULTITHREADED shuffle mode (default). Note also that these configs are only for the MULTITHREADED shuffle:
spark.rapids.shuffle.multiThreaded.reader.threads 24
spark.rapids.shuffle.multiThreaded.writer.threads 24
I would start by removing the RapidsShuffleManager
to see if that unblocks your job. If it works there, then it's likely either something to do with the UCX shuffle trying to work with your hardware/OS, or a bug in it.
Also, if you do get another hang, getting a jstack of the executors is useful. You can access a stack trace under the Spark UI in the Executors tab (click on "Thread Dump").
Also, looks like you may have pasted the successful executor log twice as the logs look identical. Would be great to see the bad one when you get a chance.
Thanks for your reply. I have fixed the executor logs.
I don't remember setting up RDMA, but I have the RDMA packages are installed and I can run ucx_perftest with the example on the docs: CUDA_VISIBLE_DEVICES=3 ucx_perftest -t tag_bw -s 10000000 -n 1000 -m cuda
(I don't think it's related, but I tried disabling UCX_IB_GPU_DIRECT_RDMA for ucx_perftest, and I got unused environment variable: UCX_IB_GPU_DIRECT_RDMA (maybe: UCX_IB_GPU_DIRECT_RDMA?)
). The nodes are connected via a switch and regular Ethernet (no Infiniband).
~Removing the multithreaded configs gave me "java.lang.IllegalStateException: The ShuffleBufferCatalog is not initialized but the RapidsShuffleManager is configured". When I switched from UCX mode to MULTITHREADED, I stopped getting that error, it still hangs like before.~ When I changed back to UCX again later on, UCX works without the multithreaded configs.
I tried running the LinReg and KMeans example again with an updated configuration where I commented most of the shuffle-related configuration. Here is the new configuration without the configs for the history server, Python/Java, and timeout values.
The KMeans example does not have any executors that were killed, but 4 executors, which are on the same node, hang. Also, there are two pending stages: stage 7 with 192 tasks and stage 8 with 1 task with the same description (javaToPython at NativeMethodAccessorImpl.java:0).
The status for Stage 9, the barrier on the CLI is (4 + 4) / 8, but it's "3/8 (5 running)" on the Spark UI. It's the same thing for linear regression with stage 7 and stage 8 as pending stages and stage 9 as an active stage with "(6 + 2) / 8". The executors were not killed, so there is no thread dump. Interestingly enough, the executors that hang always have the same index on Spark UI (indices 2 and 3).
I thought the executors were not killed because it was not using UCX, but when using UCX again, the 2 executors were not killed. I am not sure if it's because I updated the RAPIDS packages on both servers before attempting to run the servers again. The two executors failed on the other node for linear regression, which did not happen before, and this node has log4j level set to TRACE, so there is more information.
For some reason, cuML context is only initialized on the running/killed executors.
Hmm, so the examples worked after increasing the size of the dataset, even for UCX. I'm not sure why my application, which uses a large dataset did not work then. It might be a different issue. The problem with my application is that the status is "0/8" during the barrier stage, and I keep getting messages that there are zero slots available. I tried repartitioning the number of tasks to a smaller number, but it didn't work. I will send another comment later once I get the application fixed because my application stopped working altogether.
Thanks for the additional updates and glad there are some signs of it working. In looking back at your previous executor logs, it is actually the executors without the Initializing cuml context
log statements that are problematic. Somehow they are completely bypassing execution of core spark-rapids-ml code and completing their respective barrier tasks. I don't see how that could happen and would be very interesting for us to reproduce if possible.
When you get back to this, please also share the worker(s) and master startup configs. Looks like spark standalone mode.
Sorry for the late reply. Yes, I am using Spark Standalone mode.
I have the following "spark-env.sh" on each node:
The initial configuration for each node is almost identical to my app's spark.conf:
I am not sure if it is intended behaviour to hang when the dataset is too small to be in every executor. I tried updating Spark RAPIDS ML to the newest version containing the #464 commit, but it still does not work. As mentioned before, if I used the LinReg example as is on the Python README.md with df as df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))], ["label", "weight", "features"])
, it hangs, but when I did df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))]*4, ["label", "weight", "features"])
, the application runs.
For my applications, I noticed that I can get it to work now when using "multi-threaded mode" for the shuffle manager instead of UCX. Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877
if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.
Anyway, I noticed that when I run my PCA application with a small dataset (8.1KB on hdfs dfs -du -s -h
but input size is much smaller in Spark UI), the application hangs at the barrier stage. When I used two datasets that are larger than the first dataset (97.9KB and 7.7MB), it gets past the barrier. However, I noticed that the barrier stage gets stuck at "(0 + 8) / 8" for about 25s for the 97.9KB dataset and about 15s for the 7.7MB dataset, and then the stage suddenly ends without any errors.
I am not sure if it is intended behaviour to hang when the dataset is too small to be in every executor. I tried updating Spark RAPIDS ML to the newest version containing the #464 commit, but it still does not work. As mentioned before, if I used the LinReg example as is on the Python README.md with df as
df = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))], ["label", "weight", "features"])
, it hangs, but when I diddf = spark.createDataFrame([ (1.0, 2.0, Vectors.dense(1.0, 0.0)), (0.0, 2.0, Vectors.dense(0.0, 1.0))]*4, ["label", "weight", "features"])
, the application runs.
Currently the expected/intended behavior with empty partitions during training is for the tasks receiving no data to raise exceptions which should fail the whole barrier stage doing the training. It is strange, and would be a bug, that this doesn't seem to be happening in your case. I'm not able to reproduce this for some reason. I'm also not able to reproduce barrier tasks not logging anything spark-rapids-ml related before exiting. This would mean that spark-rapids-ml udf is not being invoked at all for the partition for that task. I've attempted to test if Spark might have an optimization that avoids invoking mapInPandas on partitions with no data, but so far I'm not able to trigger this on some toy examples, if it is even the case.
Also, I am not sure if this is related, but I am getting
java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877
if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.
The spark-rapids plugin recommends JDK8. See https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#apache-spark-setup-for-gpu. @abellina is that still the case? In any case, looks like you might be running into different JDK versions being invoked in different places within the application run.
However, I noticed that the barrier stage gets stuck at "(0 + 8) / 8" for about 25s for the 97.9KB dataset and about 15s for the 7.7MB dataset, and then the stage suddenly ends without any errors.
This is actually the normal behavior. The notation means that 8 of the barrier tasks are running and it is during this time that the GPUs are carrying out the distributed computation and communicating directly with each other. If it is stuck with say "(6 + 2) / 8" then 6 barrier tasks exited for some reason, without syncing and two are hanging. That is problematic and would be a bug, even with empty data partitions, as mentioned above.
Also, I am not sure if this is related, but I am getting java.io.InvalidClassException: com.nvidia.spark.rapids.GpuCast; local class incompatible: stream classdesc serialVersionUID = -3792917713274764821, local class serialVersionUID = 2642199456390263877 if I call "df.count()" even before upgrading from JDK 17 to JDK 21 and having both nodes be updated at the same commit for Spark RAPIDS. The application can run if I remove the "df.count()" line, so I don't think the barrier issue is related to this, but it does imply that something seems to be wrong with my RAPIDS environment.
The spark-rapids plugin recommends JDK8. See https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#apache-spark-setup-for-gpu. @abellina is that still the case? In any case, looks like you might be running into different JDK versions being invoked in different places within the application run.
The spark-rapids plugin is being tested with JDK 8, 11, and 17.
I think an issue here is likely that the java that the executor is seeing is different than the one that the driver is seeing. Make sure that the spark-rapids jar is the same exactly from all places.
Thanks for your replies. I downgraded my JDK version to 17 since I compiled Spark RAPIDS with JDK 17, as it uses Security Manager, and I was also facing an issue where I am getting "Found an unexpected context classloader" when using Scala Spark.
After downgrading to JDK 17 and updating RAPIDS to 23.12.00a, I am not facing "java.io.InvalidClassException" anymore on PySpark.
For the original issue, my guess is that the problem is from building Spark RAPIDS incorrectly? I tried using the JARs from mvn verify
and mvn install
, but I am using mvn install
as of now since I cannot get mvn verify
to work for now (as Maven is trying to look for the snapshot version of spark-rapids-jni on a repository and I have not built it yet.).
Anyway, I noticed that I can run the PySpark applications, but I cannot run spark-shell because it cannot find the OptimizerPlugin. I am not sure why it works for PySpark and why it does not on work on Scala Spark anymore, and whether it's related to this issue on PySpark since I do not face any other issues on PySpark aside from the problem with the empty partitions.
@an-ys thanks for the report, we are looking into it (https://github.com/NVIDIA/spark-rapids/issues/9498). It seems to be an issue with our Spark 3.5.0 shim, specific to spark-shell (pyspark shell, spark-submit don't exhibit this behavior)
Please note we don't recommend using 23.12 unless you are testing some cutting edge feature, as it's not released yet. 23.10 isn't entirely released yet either.
An option if you want to try to build on your own is to set -DallowConventionalDistJar=true
. This will sidestep the issue while we get it fixed. I confirmed that adding this to my mvn command built a jar that was loaded successfully by spark 3.5.0 (mvn package -Dbuildver=350 -DskipTests -DallowConventionalDistJar=true
).
@an-ys The original hang issue is due to a combination of empty partitions and an optimization in how spark-rapids etl plugin handles mapInPandas vs baseline Spark (which is what we had used to test empty partitions). See https://github.com/NVIDIA/spark-rapids/issues/9480 . The spark-rapids version of mapInPandas does not execute the udf on empty partitions and hence the spark-rapids-ml barrier is never entered for those tasks, leaving the other tasks (with data) hanging. Note that even after that issue is resolved to have spark-rapids mapInPandas match baseline spark mapInPandas behavior on empty partitions, and thereby avoid hanging, an exception would still be raised currently by spark-rapids-ml in the case of empty partitions.
I am trying to run the Linear Regression, KMeans, and PCA examples on a cluster of 2 nodes, each with 4 GPUs, but some of the executors in the examples always get stuck in the barrier when the cuML function is called (i.e., I get 6+2/8, 4+4/8, and 5+3/8, where 2, 4, and 3 executors are stuck in LinReg, KMeans, and PCA respectively). I also tried runing a KMeans application that deals with a large amount of data, so I do not think the problem is related to the small dataset.
I checked the logs for the executor that successfully ran the task and the executor that got stuck. The executor that got stuck initialized cuML These logs are from running the LinReg example in the Python directory of this repo. The executors that are stuck have
RUNNING | NODE_LOCAL
as the status while the successful executors haveSUCCESS PROCESS_LOCAL
.I am using Spark RAPIDS ML branch-23.10 (daedfe56edae33c565af5e06179e992cf8fec93e and f651978a03d28ef7b3295129501da4a489709979), Spark 3.5.0 on standalone mode, and Hadoop 3.3.6 on a cluster of 2 nodes, each with 4 Titan-V GPUs.
Successful Executor
``` 23/09/27 19:42:59 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB) 23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB) 23/09/27 19:42:59 INFO TorrentBroadcast: Reading broadcast variable 3 took 13 ms 23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 192, boot = -749, init = 941, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 203, boot = -723, init = 926, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 4.0 in stage 4.0 (TID 389). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 36.0 in stage 4.0 (TID 421). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 440 23/09/27 19:43:00 INFO Executor: Running task 55.0 in stage 4.0 (TID 440) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 220, boot = -692, init = 912, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 443 23/09/27 19:43:00 INFO Executor: Running task 58.0 in stage 4.0 (TID 443) 23/09/27 19:43:00 INFO Executor: Finished task 44.0 in stage 4.0 (TID 429). 2004 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 446 23/09/27 19:43:00 INFO Executor: Running task 61.0 in stage 4.0 (TID 446) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 238, boot = -679, init = 917, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = -767, init = 1006, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 12.0 in stage 4.0 (TID 397). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 20.0 in stage 4.0 (TID 405). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 453 23/09/27 19:43:00 INFO Executor: Running task 68.0 in stage 4.0 (TID 453) 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 454 23/09/27 19:43:00 INFO Executor: Running task 69.0 in stage 4.0 (TID 454) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 280, boot = -698, init = 978, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 28.0 in stage 4.0 (TID 413). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 466 23/09/27 19:43:00 INFO Executor: Running task 81.0 in stage 4.0 (TID 466) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 159, boot = -7, init = 166, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 164, boot = -14, init = 178, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 55.0 in stage 4.0 (TID 440). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 58.0 in stage 4.0 (TID 443). 2004 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 473 23/09/27 19:43:00 INFO Executor: Running task 88.0 in stage 4.0 (TID 473) 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 474 23/09/27 19:43:00 INFO Executor: Running task 89.0 in stage 4.0 (TID 474) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 173, boot = -3, init = 176, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 68.0 in stage 4.0 (TID 453). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 479 23/09/27 19:43:00 INFO Executor: Running task 94.0 in stage 4.0 (TID 479) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 244, boot = -4, init = 248, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 61.0 in stage 4.0 (TID 446). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO PythonRunner: Times: total = 194, boot = 8, init = 186, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 489 23/09/27 19:43:00 INFO Executor: Finished task 81.0 in stage 4.0 (TID 466). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Running task 104.0 in stage 4.0 (TID 489) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 249, boot = -5, init = 254, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 494 23/09/27 19:43:00 INFO Executor: Running task 109.0 in stage 4.0 (TID 494) 23/09/27 19:43:00 INFO Executor: Finished task 69.0 in stage 4.0 (TID 454). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 499 23/09/27 19:43:00 INFO Executor: Running task 114.0 in stage 4.0 (TID 499) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 215, boot = 1, init = 214, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 89.0 in stage 4.0 (TID 474). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 507 23/09/27 19:43:00 INFO Executor: Running task 122.0 in stage 4.0 (TID 507) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 272, boot = 15, init = 256, finish = 1 23/09/27 19:43:00 INFO Executor: Finished task 88.0 in stage 4.0 (TID 473). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 6, init = 233, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 515 23/09/27 19:43:00 INFO Executor: Running task 130.0 in stage 4.0 (TID 515) 23/09/27 19:43:00 INFO Executor: Finished task 94.0 in stage 4.0 (TID 479). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 519 23/09/27 19:43:00 INFO Executor: Running task 134.0 in stage 4.0 (TID 519) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 240, boot = -7, init = 247, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 114.0 in stage 4.0 (TID 499). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO PythonRunner: Times: total = 274, boot = 0, init = 274, finish = 0 23/09/27 19:43:00 INFO PythonRunner: Times: total = 259, boot = -7, init = 266, finish = 0 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 536 23/09/27 19:43:00 INFO Executor: Running task 151.0 in stage 4.0 (TID 536) 23/09/27 19:43:00 INFO Executor: Finished task 104.0 in stage 4.0 (TID 489). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO Executor: Finished task 109.0 in stage 4.0 (TID 494). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 537 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 538 23/09/27 19:43:00 INFO Executor: Running task 152.0 in stage 4.0 (TID 537) 23/09/27 19:43:00 INFO Executor: Running task 153.0 in stage 4.0 (TID 538) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 269, boot = 9, init = 260, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 122.0 in stage 4.0 (TID 507). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 547 23/09/27 19:43:00 INFO Executor: Running task 162.0 in stage 4.0 (TID 547) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 246, boot = -10, init = 256, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 134.0 in stage 4.0 (TID 519). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 560 23/09/27 19:43:00 INFO Executor: Running task 175.0 in stage 4.0 (TID 560) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 297, boot = 6, init = 290, finish = 1 23/09/27 19:43:00 INFO Executor: Finished task 130.0 in stage 4.0 (TID 515). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 568 23/09/27 19:43:00 INFO Executor: Running task 183.0 in stage 4.0 (TID 568) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 241, boot = 3, init = 238, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 151.0 in stage 4.0 (TID 536). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 570 23/09/27 19:43:00 INFO Executor: Running task 185.0 in stage 4.0 (TID 570) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 7, init = 232, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 152.0 in stage 4.0 (TID 537). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 571 23/09/27 19:43:00 INFO Executor: Running task 186.0 in stage 4.0 (TID 571) 23/09/27 19:43:00 INFO PythonRunner: Times: total = 258, boot = 14, init = 244, finish = 0 23/09/27 19:43:00 INFO Executor: Finished task 153.0 in stage 4.0 (TID 538). 2047 bytes result sent to driver 23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 574 23/09/27 19:43:01 INFO Executor: Running task 189.0 in stage 4.0 (TID 574) 23/09/27 19:43:01 INFO PythonRunner: Times: total = 215, boot = 15, init = 200, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 162.0 in stage 4.0 (TID 547). 2004 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 162, boot = -6, init = 168, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 185.0 in stage 4.0 (TID 570). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 230, boot = -5, init = 235, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 175.0 in stage 4.0 (TID 560). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 154, boot = 0, init = 154, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 189.0 in stage 4.0 (TID 574). 2004 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 244, boot = 15, init = 229, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 183.0 in stage 4.0 (TID 568). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO PythonRunner: Times: total = 219, boot = 7, init = 212, finish = 0 23/09/27 19:43:01 INFO Executor: Finished task 186.0 in stage 4.0 (TID 571). 2047 bytes result sent to driver 23/09/27 19:43:01 INFO UCX: UCX context created 23/09/27 19:43:01 INFO UCX: UCX Worker created 23/09/27 19:43:02 INFO UCX: Started UcpListener on /Killed Executor
``` 23/09/27 19:42:59 INFO MapOutputTrackerWorker: Updating epoch to 2 and clearing cache 23/09/27 19:43:00 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB) 23/09/27 19:43:00 INFO TransportClientFactory: Successfully created connection to /Here is the
spark.conf
containing the related options. I tried to disable the options related to UDFs (Scala UDF, UDF compiler, etc.), but it did not do much.`spark.conf`
``` spark.master spark://master:7077 # Resource-related configs spark.executor.instances 8 spark.executor.cores 6 spark.executor.memory 80G spark.driver.memory 80G spark.executor.memoryOverhead 1G # Task-related spark.default.parallelism 192 spark.sql.shuffle.partitions 192 spark.driver.maxResultSize 30G spark.sql.files.maxPartitionBytes 4096m # spark.sql.files.maxPartitionBytes 8192m spark.sql.execution.sortBeforeRepartition false spark.sql.adaptive.enabled true # GPU-related Configs spark.executor.resource.gpu.amount 1 spark.executor.resource.gpu.discoveryScript /usr/lib/spark/scripts/gpu/getGpusResources.sh spark.executor.resources.discoveryPlugin com.nvidia.spark.ExclusiveModeGpuDiscoveryPlugin spark.plugins com.nvidia.spark.SQLPlugin spark.rapids.memory.gpu.debug STDOUT spark.rapids.memory.gpu.pool NONE spark.rapids.memory.pinnedPool.size 20G spark.rapids.shuffle.multiThreaded.reader.threads 24 spark.rapids.shuffle.multiThreaded.writer.threads 24 spark.rapids.sql.concurrentGpuTasks 2 spark.rapids.sql.enabled true spark.rapids.sql.exec.CollectLimitExec true spark.rapids.sql.explain all spark.rapids.sql.expression.ScalaUDF true spark.rapids.sql.metrics.level DEBUG spark.rapids.sql.rowBasedUDF.enabled true spark.rapids.sql.udfCompiler.enabled true spark.shuffle.manager com.nvidia.spark.rapids.spark350.RapidsShuffleManager spark.task.resource.gpu.amount 0.166 spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer spark.rapids.shuffle.mode UCX spark.shuffle.service.enabled false spark.dynamicAllocation.enabled false spark.executorEnv.UCX_ERROR_SIGNALS spark.executorEnv.UCX_MEMTYPE_CACHE n spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024 spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,rc,tcp spark.executorEnv.UCX_RNDV_SCHEME put_zcopy spark.executorEnv.UCX_MAX_RNDV_RAILS 1 spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA n ```