Open captify-sivakhno opened 7 months ago
while this error looks different, it might be related to https://github.com/NVIDIA/spark-rapids/issues/10318
@captify-sivakhno Could you please provide the details of the table? You can retrieve the information by executing either spark.sql(f'DESCRIBE FORMATTED unity_catalog_name.schema_name.table_name').show(truncate=False)
in Spark or DESCRIBE FORMATTED unity_catalog_name.schema_name.table_name;
in SQL Editor. Can you also share the complete stack trace of the error?
@captify-sivakhno What is the logic of this use case? I saw you firstly read a Spark dataframe and then convert it to a pandas dataframe? Does it mean you plan to do some transformation on pandas dataframe?
Describe the bug I have set up RAPIDS on Databricks AWS cluster (runtime 12.2.x-gpu-ml-scala2.12) as described in https://docs.nvidia.com/spark-rapids/user-guide/23.12.2/getting-started/databricks.html I then trying to read delta lake on the Unity Catalogue (we have it enabled as it's Databricks main data catalogue offering)
data = spark.read.format("delta").table("table name").toPandad()
and get access denied error:: org.apache.spark.SparkException: Exception thrown in awaitResult: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 4264) (172.17.190.255 executor 2): java.nio.file.AccessDeniedException: s3://categories/8i/part-00073-d55722a3-743e-4b23-93eb-d264d9f0b897.c000.snappy.parquet: getFileStatus on s3://categories/8i/part-00073-d55722a3-743e-4b23-93eb-d264d9f0b897.c000.snappy.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden
Steps/Code to reproduce bug [Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.]
Set RAPIDS on 12.2.x-gpu-ml-scala2.12 Databricks AWS cluster runtime using
spark.rapids.sql.concurrentGpuTasks 2 spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.12.2.jar:/databricks/spark/python spark.rapids.sql.python.gpu.enabled true spark.rapids.memory.pinnedPool.size 2G spark.rapids.sql.format.parquet.reader.type PERFILE spark.task.resource.gpu.amount 0.1 spark.plugins com.nvidia.spark.SQLPlugin spark.python.daemon.module rapids.daemon_databricks
pip install cudf-cu11==23.12.1
read any delta lake table on unity catalogue
Expected behavior As Unity Catalogue is Databricks main data catalogue offering and enabled by default, I would expect RAPIDS support unity data access. Without RAPIDS config cluster can access Unity Catalogue normally.
Additional context Add any other context about the problem here.