delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.18k stars 1.62k forks source link

ClassNotFoundException: delta.DeafultSource on Spark3.1.2-2 #2195

Open sinban04 opened 8 months ago

sinban04 commented 8 months ago

Issue Description

Hello, I'm trying to use delta format on Spark3.1.2-2 w/ scala. I followed the guide QuickStart and found compatible delta version w/ this page I used this maven repo and used Delta version 1.0.1 for spark 3.1.2-2

I built w/ delta dependencies and added configuration during spark submit

Command & Configs

spark-submit 
...
    --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
    --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \

and in the source

    val spark = SparkSession.builder
                      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
                      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
                      .getOrCreate
...

    df.coalesce(16).write.format("delta").save(outputPath)
or
    df.coalesce(16).write.delta(outputPath)

https://rmoff.net/2023/04/05/using-delta-from-pyspark-java.lang.classnotfoundexception-delta.defaultsource/ I checked my SparkSession contained all the delta configs

Logs

But i got an error w/ such log

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
        at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:993)
        at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:311)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
        at com.naver.airspace.recsysops.App$.refine(App.scala:66)
        at com.naver.airspace.recsysops.App$.main(App.scala:107)
        at com.naver.airspace.recsysops.App.main(App.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
        ... 19 more

Related Issue

I checked the issues on this repo, and most of them are using pyspark not the same case as me. https://github.com/delta-io/delta/issues/1013

Besides, I have all the class files on my Über jar (https://github.com/delta-io/delta/issues/700, https://github.com/delta-io/delta/issues/224) Not only META-INF/services/org.apache.spark.sql.sources.DataSourceRegister, but also all the io.delta classes (I used assembly plugin, therefore i've never been through dependency problem so far)

  4705 io/delta/
  4706 io/delta/exceptions/
  4707 io/delta/implicits/
  4708 io/delta/sql/
  4709 io/delta/sql/parser/
  4710 io/delta/storage/
  4711 io/delta/tables/
  4712 io/delta/tables/execution/
...
 46738 org/apache/spark/sql/execution/streaming/OffsetHolder$.class
 46739 org/apache/spark/sql/execution/ui/SparkListenerDriverAccumUpdates.class
 46740 org/apache/spark/sql/execution/ui/SparkListenerSQLAdaptiveSQLMetricUpdates.class
 46741 META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
 46742 org/apache/spark/sql/jdbc/H2Dialect$.class
 46743 org/apache/spark/sql/jdbc/DB2Dialect$.class
 46744 org/apache/spark/sql/api/python/PythonSQLUtils.class
...

It seems it contains DataSourceRegister but, still can't find the source (https://github.com/delta-io/delta/issues/947)

Could you help me out on this issue ? What am i missing ?

sinban04 commented 8 months ago

Spark 3.2.4 w/ Delta 2.0.2

I tried spark3.2.4 and delta2.0.2, (https://docs.delta.io/latest/releases.html) but unfortunately, it returns same error

Exception in thread "main" java.lang.ClassNotFoundException:
Failed to find data source: delta. Please find packages at
http://spark.apache.org/third-party-projects.html

        at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:443)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:670)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
        at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
        at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
        at io.delta.implicits.package$DeltaDataFrameWriter$.delta$extension(package.scala:59)
        at com.naver.airspace.recsysops.App$.refine(App.scala:65)
        at com.naver.airspace.recsysops.App$.main(App.scala:112)
        at com.naver.airspace.recsysops.App.main(App.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:966)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:191)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:214)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1054)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1063)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:656)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:656)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656)
        ... 20 more
sinban04 commented 8 months ago

Spark Shell

Spark 3.4.1

When i tried to do w/ spark-shell (https://docs.delta.io/latest/quick-start.html#spark-scala-shell) I works fine with spark 3.4.1

spark-shell --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

As this spark shell works, Pyspark 3.4.1 worked fined Then why is there an error w/ scala spark ?

Spark 3.1.2

but, when i tried spark 3.1.2-2

bin/spark-shell --packages io.delta:delta-core_2.12:1.0.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

(It seems that the cause of this error is kinda my system's internal issue. It works fine with download spark version.)

ERROR LOG (click) ## Error log ``` :: loading settings :: url = jar:file:/home1/aa/SPARK/spark-3.1.2-2-bin-c3s-hadoop/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home1/aa/.ivy2/cache The jars for the packages stored in: /home1/aa/.ivy2/jars io.delta#delta-core_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-c41fa0c7-130b-40e4-995e-45052ddbd6b7;1.0 confs: [default] found io.delta#delta-core_2.12;1.0.1 in local-m2-cache found org.antlr#antlr4;4.7 in local-m2-cache found org.antlr#antlr4-runtime;4.7 in local-m2-cache found org.antlr#antlr-runtime;3.5.2 in local-m2-cache found org.antlr#ST4;4.0.8 in local-m2-cache found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in local-m2-cache found org.glassfish#javax.json;1.0.4 in local-m2-cache found com.ibm.icu#icu4j;58.2 in local-m2-cache downloading file:/home1/aa/.m2/repository/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar ... [SUCCESSFUL ] io.delta#delta-core_2.12;1.0.1!delta-core_2.12.jar (2ms) downloading file:/home1/aa/.m2/repository/org/antlr/antlr4/4.7/antlr4-4.7.jar ... [SUCCESSFUL ] org.antlr#antlr4;4.7!antlr4.jar (1ms) downloading file:/home1/aa/.m2/repository/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.jar ... [SUCCESSFUL ] org.antlr#antlr4-runtime;4.7!antlr4-runtime.jar (1ms) downloading file:/home1/aa/.m2/repository/org/antlr/antlr-runtime/3.5.2/antlr-runtime-3.5.2.jar ... [SUCCESSFUL ] org.antlr#antlr-runtime;3.5.2!antlr-runtime.jar (1ms) downloading file:/home1/aa/.m2/repository/org/abego/treelayout/org.abego.treelayout.core/1.0.3/org.abego.treelayout.core-1.0.3.jar ... [SUCCESSFUL ] org.abego.treelayout#org.abego.treelayout.core;1.0.3!org.abego.treelayout.core.jar(bundle) (1ms) downloading file:/home1/aa/.m2/repository/org/glassfish/javax.json/1.0.4/javax.json-1.0.4.jar ... [SUCCESSFUL ] org.glassfish#javax.json;1.0.4!javax.json.jar(bundle) (1ms) downloading file:/home1/aa/.m2/repository/com/ibm/icu/icu4j/58.2/icu4j-58.2.jar ... [SUCCESSFUL ] com.ibm.icu#icu4j;58.2!icu4j.jar (9ms) :: resolution report :: resolve 8017ms :: artifacts dl 21ms :: modules in use: com.ibm.icu#icu4j;58.2 from local-m2-cache in [default] io.delta#delta-core_2.12;1.0.1 from local-m2-cache in [default] org.abego.treelayout#org.abego.treelayout.core;1.0.3 from local-m2-cache in [default] org.antlr#ST4;4.0.8 from local-m2-cache in [default] org.antlr#antlr-runtime;3.5.2 from local-m2-cache in [default] org.antlr#antlr4;4.7 from local-m2-cache in [default] org.antlr#antlr4-runtime;4.7 from local-m2-cache in [default] org.glassfish#javax.json;1.0.4 from local-m2-cache in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 8 | 8 | 8 | 0 || 8 | 7 | --------------------------------------------------------------------- :: problems summary :: :::: WARNINGS [NOT FOUND ] org.antlr#ST4;4.0.8!ST4.jar (0ms) ==== local-m2-cache: tried file:/home1/aa/.m2/repository/org/antlr/ST4/4.0.8/ST4-4.0.8.jar :::::::::::::::::::::::::::::::::::::::::::::: :: FAILED DOWNLOADS :: :: ^ see resolution messages for details ^ :: :::::::::::::::::::::::::::::::::::::::::::::: :: org.antlr#ST4;4.0.8!ST4.jar :::::::::::::::::::::::::::::::::::::::::::::: :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Exception in thread "main" java.lang.RuntimeException: [download failed: org.antlr#ST4;4.0.8!ST4.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1429) at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ```

In Short,

sinban04 commented 8 months ago

I spent quite a lot on this, and i figured out some facts. When i tried with spark-sql on several spark versions, after delta version 1.2.0, we need to import only delta-core, but also delta-storage before 1.2.0, delta-core is enough (I could run pyspark without error with those dependencies)

Maven dependency problems

I succeeded to run scala spark giving delta library dependency w/ jars option as with pyspark. It succeeded to read delta file well. However, It still fails on Maven dependency injection even w/ delta-core and delta-storage dependencies.

On spark 3.2.4

      <!-- https://mvnrepository.com/artifact/io.delta/delta-core -->
      <dependency>
          <groupId>io.delta</groupId>
          <artifactId>delta-core_2.12</artifactId>
          <version>2.0.2</version>
      </dependency>

      <!-- https://mvnrepository.com/artifact/io.delta/delta-storage -->
      <dependency>
          <groupId>io.delta</groupId>
          <artifactId>delta-storage</artifactId>
          <version>2.0.2</version>
      </dependency>

For the scala spark (so far as i know) As we expected, It seems obvious that it's another dependency problem like before (https://github.com/delta-io/delta/issues/224)

sinban04 commented 1 month ago

Sbt

I tried this issue on Sbt and build w/ aseembly plugins, and still it shows the exact same error as w/ Maven

I added dependency

     "io.delta" % "delta-core_2.12" % "1.0.1",

and still it shows

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
        at com.naver.airspace.recsysops.Main$.runSpark(Main.scala:85)
        at com.naver.airspace.recsysops.Main$.main(Main.scala:141)
        at com.naver.airspace.recsysops.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
        ... 18 more

Are you sure, "io.delta" % "delta-core_2.12" % "1.0.1", this includes the proper dependencies ?