PySpark virtualenv: Missing dependencies lead to No FileSystem for scheme "gs"

GuiBou165 commented 6 months ago

I'm running PySpark locally using asdf, virtualenv and direnv to manage the packages, hence, I'm placing the jar in the .direnv's local environment jars directory (.direnv/env_name/lib/python3.11/site-packages/pyspark/jars/), which are being configured in the SparkSession:

builder: SparkSession.Builder = (
    SparkSession.builder.appName("ModularDW")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
    .config(
        "spark.jars.packages",
        "com.google.cloud.spark:spark-bigquery-with-dependencies_2.13:0.38.0,com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
    )
)

spark_session: SparkSession = configure_spark_with_delta_pip(builder).getOrCreate()

Installations:

pyspark==3.5.1
delta-spark==3.2.0
google-cloud-bigquery>=3.16.0
google-cloud-storage>=2.15.0

I have tried all versions and still the issue remains:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.13-0.38.0.jar
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.38.0.jar
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.29.0.jar

When running a read from GCP, I'm obtaining the following error:

Exception has occurred: Py4JJavaError
An error occurred while calling o35.parquet.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:724)
    at scala.collection.immutable.List.map(List.scala:293)
    at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:722)
    at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:551)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:404)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:563)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52)
    at java.base/java.lang.reflect.Method.invoke(Method.java:580)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:1583)

Upon reading the logs, it seems the issue stems from the Connectors' dependencies:

24/05/16 07:58:01 WARN FileSystem: Cannot load filesystem: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem Unable to get public no-arg constructor
24/05/16 07:58:01 WARN FileSystem: java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer
24/05/16 07:58:01 WARN FileSystem: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer

GuiBou165 commented 6 months ago

The strangest part is, even though the .jar is located in the appropriate folder within the virtual environment, if I print the existing Packages I just receive Delta's: print("Current SparkSession packages:", spark_session.conf.get("spark.jars.packages")) yields: Current SparkSession packages: io.delta:delta-spark_2.12:3.2.0

isha97 commented 6 months ago

Hi @GuiBou165 ,

You are missing the GCS connector jar. From the error message it's not able to find the GCS FileSystem. Please download the latest jar from here : https://github.com/GoogleCloudDataproc/hadoop-connectors/releases and follow the instructions here to set it up : https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md

GuiBou165 commented 6 months ago

Hi @GuiBou165 ,

You are missing the GCS connector jar. From the error message it's not able to find the GCS FileSystem. Please download the latest jar from here : https://github.com/GoogleCloudDataproc/hadoop-connectors/releases and follow the instructions here to set it up : https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md

Hi @isha97 , whilst I had the GCS connector, I was not aware I also needed to include its dependencies as well (gcs-connector-hadoop3-2.2.22-shaded.jar, gcsio-2.2.22.jar, util-2.2.22.jar,util-hadoop-hadoop3-2.2.22.jar) which sorted the issue!

Thank you for your help

GoogleCloudDataproc / spark-bigquery-connector

PySpark virtualenv: Missing dependencies lead to No FileSystem for scheme "gs" #1226