Closed GuiBou165 closed 6 months ago
The strangest part is, even though the .jar
is located in the appropriate folder within the virtual environment, if I print the existing Packages I just receive Delta's:
print("Current SparkSession packages:", spark_session.conf.get("spark.jars.packages"))
yields:
Current SparkSession packages: io.delta:delta-spark_2.12:3.2.0
Hi @GuiBou165 ,
You are missing the GCS connector jar. From the error message it's not able to find the GCS FileSystem. Please download the latest jar from here : https://github.com/GoogleCloudDataproc/hadoop-connectors/releases and follow the instructions here to set it up : https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md
Hi @GuiBou165 ,
You are missing the GCS connector jar. From the error message it's not able to find the GCS FileSystem. Please download the latest jar from here : https://github.com/GoogleCloudDataproc/hadoop-connectors/releases and follow the instructions here to set it up : https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md
Hi @isha97 , whilst I had the GCS connector, I was not aware I also needed to include its dependencies as well (gcs-connector-hadoop3-2.2.22-shaded.jar, gcsio-2.2.22.jar, util-2.2.22.jar,util-hadoop-hadoop3-2.2.22.jar) which sorted the issue!
Thank you for your help
I'm running PySpark locally using
asdf
,virtualenv
anddirenv
to manage the packages, hence, I'm placing thejar
in the.direnv
's local environment jars directory (.direnv/env_name/lib/python3.11/site-packages/pyspark/jars/
), which are being configured in theSparkSession
:Installations:
I have tried all versions and still the issue remains:
When running a read from GCP, I'm obtaining the following error:
Upon reading the logs, it seems the issue stems from the Connectors' dependencies: