delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.23k stars 1.62k forks source link

[QUESTION] How can one generate Uniform data using local standalone spark? #3217

Open frosforever opened 3 weeks ago

frosforever commented 3 weeks ago

Question

Which Delta project/connector is this regarding?

Describe the problem

How can one use Uniform from a local standalone spark runtime? The docs state that Uniform requires HMS but should otherwise work. However I can't seem to find the incantation required to get the iceberg metadata to write without failing. Apologies if this is not the correct forum for such questions! Please feel free to direct me elsewhere if that's the case. Thanks very much!

Steps to reproduce

// build.sbt dependencies:
 Seq(
  "org.apache.spark" %% "spark-core",
  "org.apache.spark" %% "spark-sql",
  "org.apache.spark" %% "spark-hive"
).map(_ % "3.5.0") ++ Seq(
  "io.delta" %% "delta-spark"   % "3.2.0",
  "io.delta" %% "delta-iceberg" % "3.2.0"
)
val spark = SparkSession
      .builder()
      .master("local[*]")
      .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      .config("spark.sql.catalog.spark_catalog.type", "hadoop")
      .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
      .enableHiveSupport()
      .getOrCreate()

Seq((1, "a"))
  .toDF("a", "b")
  .write.format("delta")
  .option("delta.enableIcebergCompatV2", "true")
  .option("delta.universalFormat.enabledFormats", "iceberg")
  .option("delta.enableDeletionVectors", false)
  .saveAsTable("spark_catalog.default.test")

Observed results

Blows up with:

15:09:47.086 [        pool-616-thread-1] ERROR          org.apache.spark.util.Utils - Aborting task
org.apache.hadoop.hive.metastore.api.MetaException: Unable to update transaction database java.sql.SQLSyntaxErrorException: Table/View 'NEXT_LOCK_ID' does not exist.
    at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
...
Caused by: ERROR 42X05: Table/View 'NEXT_LOCK_ID' does not exist.
    at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
    at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
...
15:09:47.073 [        pool-616-thread-1] ERROR org.apache.spark.sql.delta.icebergShaded.IcebergConverter - Error when converting to Iceberg metadata
org.apache.hadoop.hive.metastore.api.MetaException: Unable to update transaction database java.sql.SQLSyntaxErrorException: Table/View 'NEXT_LOCK_ID' does not exist.
    at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)

Expected results

Table generated with delta & iceberg metadata

Further details

Does delta Uniform not play with with Derby similar to Iceberg? https://github.com/apache/iceberg/issues/8277#issuecomment-1680875724 & https://github.com/apache/iceberg/issues/7847? Are there other dependencies that are missing e.g. "org.apache.iceberg" %% "iceberg-spark-runtime-3.5" % "1.5.2"?

Environment information

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

lzlfred commented 2 weeks ago

hey @frosforever Uniform requires HMS as the Iceberg catalog. If one just use Hadoop as the catalog, its expected that the Uniform Iceberg conversion cannot find HMS and thus fails.

Requiring HMS is an explicit design decision because the file system based Iceberg https://iceberg.apache.org/spec/#file-system-tables requires file system to support atomic rename, which not all file systems support. Using file system based Iceberg on a non-supported file system may leads to race conditions and data corruption.

frosforever commented 2 weeks ago

Thanks for the response @lzlfred!

I'm not sure I'm completely following. Does the HMS requirement mean that the internally supported HMS with hadoop catalog type is not supported? Is this due to the use of Derby and something similar to what is suggested here https://github.com/apache/iceberg/issues/7847#issuecomment-2008290040 be required for Uniform as well?

Is there any way to write a uniform table using local spark, for example in integration tests, without setting up externally running Hive?