GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
358 stars 189 forks source link

Error: This connector was made for Scala null, it was not meant to run on Scala 2.12 #1214

Closed malhomaid closed 2 months ago

malhomaid commented 2 months ago

Hello,

I'm using the connector in pyspark and I'm facing this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o81.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
    at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
    at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
    at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
    at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
    at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
    at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
    at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:629)
    at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:158)
    at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:145)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalStateException:  This connector was made for Scala null, it was not meant to run on Scala 2.12
    at com.google.cloud.spark.bigquery.BigQueryUtilScala$.validateScalaVersionCompatibility(BigQueryUtil.scala:37)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:42)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:49)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
    ... 30 more

The command I used below(I copied the jars using mvn dependency:copy-dependencies then specified all the jars, not sure if there is a better way I used a fat jar but it was not reading Kafka classes). gcloud dataproc jobs submit pyspark --project systems-staging-ce59 --cluster dataproc-cluster-6c3 --region me-central2 --jars target/dependency/abris_2.12-6.4.0.jar,target/dependency/avro-1.10.1.jar,target/dependency/checker-qual-3.8.0.jar,target/dependency/common-utils-6.2.1.jar,target/dependency/commons-compress-1.21.jar,target/dependency/commons-lang3-3.2.1.jar,target/dependency/commons-logging-1.1.3.jar,target/dependency/commons-pool2-2.11.1.jar,target/dependency/commons_2.12-1.0.0.jar,target/dependency/error_prone_annotations-2.5.1.jar,target/dependency/failureaccess-1.0.1.jar,target/dependency/guava-30.1.1-jre.jar,target/dependency/hadoop-client-api-3.3.4.jar,target/dependency/hadoop-client-runtime-3.3.4.jar,target/dependency/j2objc-annotations-1.3.jar,target/dependency/jackson-annotations-2.10.5.jar,target/dependency/jackson-core-2.11.3.jar,target/dependency/jackson-databind-2.10.5.1.jar,target/dependency/jackson-dataformat-yaml-2.11.1.jar,target/dependency/jakarta.annotation-api-1.3.5.jar,target/dependency/jakarta.inject-2.6.1.jar,target/dependency/jakarta.ws.rs-api-2.1.6.jar,target/dependency/jersey-common-2.34.jar,target/dependency/jsr305-3.0.0.jar,target/dependency/kafka-avro-serializer-6.2.1.jar,target/dependency/kafka-clients-3.4.1.jar,target/dependency/kafka-schema-registry-client-6.2.1.jar,target/dependency/kafka-schema-serializer-6.2.1.jar,target/dependency/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar,target/dependency/lz4-java-1.8.0.jar,target/dependency/osgi-resource-locator-1.0.3.jar,target/dependency/scala-library-2.12.15.jar,target/dependency/slf4j-api-1.7.36.jar,target/dependency/snakeyaml-1.26.jar,target/dependency/snappy-java-1.1.8.4.jar,target/dependency/spark-avro_2.12-3.5.0.jar,target/dependency/spark-bigquery-with-dependencies_2.12-0.37.0.jar,target/dependency/spark-sql-kafka-0-10_2.12-3.5.0.jar,target/dependency/spark-tags_2.12-3.5.0.jar,target/dependency/spark-token-provider-kafka-0-10_2.12-3.5.0.jar,target/dependency/swagger-annotations-1.6.2.jar,target/dependency/swagger-core-1.6.2.jar,target/dependency/swagger-models-1.6.2.jar,target/dependency/xz-1.9.jar --py-files abris.py kafka-to-bigquery.py

Maven pom.xml:

    <dependencies>
        <dependency>
            <groupId>com.google.cloud.spark</groupId>
            <artifactId>spark-bigquery-with-dependencies_2.12</artifactId>
            <version>0.37.0</version>
        </dependency>
        <dependency>
            <groupId>za.co.absa</groupId>
            <artifactId>abris_2.12</artifactId>
            <version>6.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
            <version>3.5.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.12</artifactId>
            <version>3.5.0</version>
        </dependency>
    </dependencies>

Dataproc version: 2.2.10-debian12 Spark version: 3.5.0 Scala version: 2.12.18

davidrabinowitz commented 2 months ago

Are you using Dataproc? If so, then both Spark 3.5 offerings (image 2.2, Serverless runtime 2.2) the connector is built in the image, so you don't need to provide it. My guess this is part of the problem. You can change the version or the jar of the connector as explained here.

The other issue is that some of the classes you provide alreadt exist in spark (like avro) or interfere with other spark dependencies (like guava). I'd recommend to create a shaded jar containing only the dependencies whose spark version you don't want to use (guava is a good candidate for that).

malhomaid commented 2 months ago

@davidrabinowitz Thanks I didn't provide the connector jar and it worked 👍