NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
802 stars 234 forks source link

[BUG] Can't access Unity Catalogue data on Databricks AWS cluster #10566

Open captify-sivakhno opened 7 months ago

captify-sivakhno commented 7 months ago

Describe the bug I have set up RAPIDS on Databricks AWS cluster (runtime 12.2.x-gpu-ml-scala2.12) as described in https://docs.nvidia.com/spark-rapids/user-guide/23.12.2/getting-started/databricks.html I then trying to read delta lake on the Unity Catalogue (we have it enabled as it's Databricks main data catalogue offering)

data = spark.read.format("delta").table("table name").toPandad() and get access denied error:

: org.apache.spark.SparkException: Exception thrown in awaitResult: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 4264) (172.17.190.255 executor 2): java.nio.file.AccessDeniedException: s3://categories/8i/part-00073-d55722a3-743e-4b23-93eb-d264d9f0b897.c000.snappy.parquet: getFileStatus on s3://categories/8i/part-00073-d55722a3-743e-4b23-93eb-d264d9f0b897.c000.snappy.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden

Steps/Code to reproduce bug [Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.]

Set RAPIDS on 12.2.x-gpu-ml-scala2.12 Databricks AWS cluster runtime using

spark.rapids.sql.concurrentGpuTasks 2 spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.12.2.jar:/databricks/spark/python spark.rapids.sql.python.gpu.enabled true spark.rapids.memory.pinnedPool.size 2G spark.rapids.sql.format.parquet.reader.type PERFILE spark.task.resource.gpu.amount 0.1 spark.plugins com.nvidia.spark.SQLPlugin spark.python.daemon.module rapids.daemon_databricks

pip install cudf-cu11==23.12.1

read any delta lake table on unity catalogue

Expected behavior As Unity Catalogue is Databricks main data catalogue offering and enabled by default, I would expect RAPIDS support unity data access. Without RAPIDS config cluster can access Unity Catalogue normally.

Additional context Add any other context about the problem here.

tgravescs commented 7 months ago

while this error looks different, it might be related to https://github.com/NVIDIA/spark-rapids/issues/10318

SurajAralihalli commented 7 months ago

@captify-sivakhno Could you please provide the details of the table? You can retrieve the information by executing either spark.sql(f'DESCRIBE FORMATTED unity_catalog_name.schema_name.table_name').show(truncate=False) in Spark or DESCRIBE FORMATTED unity_catalog_name.schema_name.table_name; in SQL Editor. Can you also share the complete stack trace of the error?

viadea commented 7 months ago

@captify-sivakhno What is the logic of this use case? I saw you firstly read a Spark dataframe and then convert it to a pandas dataframe? Does it mean you plan to do some transformation on pandas dataframe?