NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
806 stars 234 forks source link

No module named 'cudf' while running spark-rapids with AWS EMR-7.3 #11668

Open Basir-mahmood opened 1 week ago

Basir-mahmood commented 1 week ago

Thanks for such a great work and awesome library. I am using spark-rapids with EMR-7.3 for the deep learning model inference with predict_batch_udf. I have been following the provided documentation for AWS-EMR. And for enabling GPU-scheduling with pandas_udf, as described in the link. I am providing --py-files ${SPARK_RAPIDS_PLUGIN_JAR} in the spark-submit command, and also have added in the config.json file "spark.rapids.sql.python.gpu.enabled": "true" to enable gpu-scheduling for the pandas-udf. The instances I am using are m5.4xlarge ( master ), and g4dn.12xlarge ( core ).

However, this task fails giving the error for no cudf module found.

-- spark-submit-command -- spark-submit --deploy-mode client --py-files /usr/lib/spark/jars/rapids-4-spark_2.12-24.06.1-amzn-0.jar s3://<my-bucket>/rapids-code.py

Following lines are from the logged error of emr.

24/10/28 13:48:12 WARN RapidsPluginUtils: RAPIDS Accelerator 24.06.1-amzn-0 using cudf 24.06.0, private revision 755b4dd03c753cacb7d141f3b3c8ff9f83888b69

...
...
...

24/10/28 13:48:28 INFO PythonWorkerFactory: Python daemon module in PySpark is set to [rapids.daemon] in 'spark.python.daemon.module', using this to start the daemon up. Note that this configuration only has an effect when 'spark.python.use.daemon' is enabled and the platform is not Windows.
INFO: Process 34593 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/daemon.py", line 131, in manager
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
ModuleNotFoundError: No module named 'cudf'
INFO: Process 34594 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/daemon.py", line 131, in manager
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
ModuleNotFoundError: No module named 'cudf'
INFO: Process 34595 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/daemon.py", line 131, in manager
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
ModuleNotFoundError: No module named 'cudf'
24/10/28 13:48:29 ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 107)
java.io.EOFException: null
    at java.io.DataInputStream.readInt(DataInputStream.java:386) ~[?:?]
....
....
gerashegalov commented 1 week ago

rapids-4-spark jar only provides the minimum Java binding to be able to run SparkSQL/DataFrame API queries. For Pandas-like Python cudf module cudf needs to be installed on all nodes using one of EMR-recommended means for installing Python libs. cudf is available as a pip package among other things

Basir-mahmood commented 6 days ago

@gerashegalov Thanks for the guidance. I also want to ask that I want to control the number of concurrent gpu tasks which are created by udf method ( i am using predict_batch_udf). I have tried spark.rapids.sql.concurrentGpuTasks but it doesnt control the number of concurrent task in GPU. Currently, the number of tasks in gpu equals to the 1/spark.task.resource.gpu.amount . Can you please help me with that ?

eordentlich commented 5 days ago

You can edit and add a version of this init script to run after the spark_rapids one to install the cudf python library: https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.10/notebooks/aws-emr/init-bootstrap-action.sh Note that it builds and installs python 3.10 since cudf 24.10 and beyond have dropped support for 3.9. You will also need to configure Spark to use this non-default python in the driver and executors (see https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.10/notebooks/aws-emr/init-configurations.json#L71-L72 ) . spark.rapids.sql.concurrentGpuTasks only applies to the core JVM part of spark-rapids.

The number of concurrent predict_batch_udf tasks is determined by the resource per task and resource per executor settings, as you say. Are you hoping to have different task concurrency per stage?

The spark.rapids.python.concurrentPythonWorkers config described here https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#other-configuration might also be applicable to this.