Open Basir-mahmood opened 1 week ago
rapids-4-spark jar only provides the minimum Java binding to be able to run SparkSQL/DataFrame API queries. For Pandas-like Python cudf
module cudf
needs to be installed on all nodes using one of EMR-recommended means for installing Python libs. cudf is available as a pip package among other things
@gerashegalov Thanks for the guidance. I also want to ask that I want to control the number of concurrent gpu tasks which are created by udf method ( i am using predict_batch_udf). I have tried spark.rapids.sql.concurrentGpuTasks but it doesnt control the number of concurrent task in GPU. Currently, the number of tasks in gpu equals to the 1/spark.task.resource.gpu.amount . Can you please help me with that ?
You can edit and add a version of this init script to run after the spark_rapids one to install the cudf python library: https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.10/notebooks/aws-emr/init-bootstrap-action.sh
Note that it builds and installs python 3.10 since cudf 24.10 and beyond have dropped support for 3.9. You will also need to configure Spark to use this non-default python in the driver and executors (see https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.10/notebooks/aws-emr/init-configurations.json#L71-L72 ) .
spark.rapids.sql.concurrentGpuTasks
only applies to the core JVM part of spark-rapids.
The number of concurrent predict_batch_udf tasks is determined by the resource per task and resource per executor settings, as you say. Are you hoping to have different task concurrency per stage?
The spark.rapids.python.concurrentPythonWorkers
config described here https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#other-configuration might also be applicable to this.
Thanks for such a great work and awesome library. I am using spark-rapids with EMR-7.3 for the deep learning model inference with predict_batch_udf. I have been following the provided documentation for AWS-EMR. And for enabling GPU-scheduling with pandas_udf, as described in the link. I am providing --py-files ${SPARK_RAPIDS_PLUGIN_JAR} in the spark-submit command, and also have added in the config.json file
"spark.rapids.sql.python.gpu.enabled": "true"
to enable gpu-scheduling for the pandas-udf. The instances I am using are m5.4xlarge ( master ), and g4dn.12xlarge ( core ).However, this task fails giving the error for no cudf module found.
-- spark-submit-command --
spark-submit --deploy-mode client --py-files /usr/lib/spark/jars/rapids-4-spark_2.12-24.06.1-amzn-0.jar s3://<my-bucket>/rapids-code.py
Following lines are from the logged error of emr.