NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[BUG] cudf_udf nightly tests failing due to no attribute __pyx_capi__ #11693

Open jlowe opened 2 weeks ago

jlowe commented 2 weeks ago

Nightly cudf_udf test builds recently started failing with exceptions like the following:

INFO: Process 1469 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/home/...../jars/rapids-4-spark_2.12-24.12.0-SNAPSHOT-cuda11.jar/rapids/daemon.py", line 131, in manag
er
  File "/home/...../jars/rapids-4-spark_2.12-24.12.0-SNAPSHOT-cuda11.jar/rapids/worker.py", line 37, in initia
lize_gpu_mem
    from cudf import rmm
  File "/opt/conda/lib/python3.10/site-packages/cudf/__init__.py", line 19, in <module>
    _setup_numba()
  File "/opt/conda/lib/python3.10/site-packages/cudf/utils/_numba.py", line 121, in _setup_numba
    shim_ptx_cuda_version = _get_cuda_build_version()
  File "/opt/conda/lib/python3.10/site-packages/cudf/utils/_numba.py", line 16, in _get_cuda_build_version
    from cudf._lib import strings_udf
  File "/opt/conda/lib/python3.10/site-packages/cudf/_lib/__init__.py", line 4, in <module>
    from . import (
  File "avro.pyx", line 1, in init cudf._lib.avro
  File "utils.pyx", line 1, in init cudf._lib.utils
  File "column.pyx", line 1, in init cudf._lib.column
  File "/opt/conda/lib/python3.10/site-packages/rmm/__init__.py", line 17, in <module>
    from rmm import mr
  File "/opt/conda/lib/python3.10/site-packages/rmm/mr.py", line 14, in <module>
    from rmm.pylibrmm.memory_resource import (
  File "/opt/conda/lib/python3.10/site-packages/rmm/pylibrmm/__init__.py", line 15, in <module>
    from .device_buffer import DeviceBuffer
  File "device_buffer.pyx", line 1, in init rmm.pylibrmm.device_buffer
AttributeError: module 'cuda.ccudart' has no attribute '__pyx_capi__'
INFO: Process 1503 found CUDA visible device(s): 0
24/11/05 14:10:10 ERROR Executor: Exception in task 2.0 in stage 1.0 (TID 8)
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:121)
        at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:137)
        at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:136)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:106)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:121)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:162)
        at org.apache.spark.sql.rapids.execution.python.GpuArrowEvalPythonExec.$anonfun$internalDoExecuteColumnar$2(GpuArrowEvalPythonExec.scala:456)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[.....]
leofang commented 2 weeks ago

This is tracked in https://github.com/NVIDIA/cuda-python/issues/215. We are working on it. For the time being please downgrade your cuda-python version as instructed there.

mattahrens commented 2 weeks ago

We need to update our test environment to pin the version of cuda-python.

pxLi commented 2 weeks ago

We need to update our test environment to pin the version of cuda-python.

The cudf-udf pipeline is designed to monitor nightly CUDF-py changes. I recommend keeping it running against the latest nightly CUDF build unless we decide not to wait for the fix in this release, thanks