delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.58k stars 1.7k forks source link

[Feature Request] Hive Metastore compatibility between different systems #1045

Open zsxwing opened 2 years ago

zsxwing commented 2 years ago

Feature request

Overview

Currently when we creates a Delta table in Hive Metastore using different systems, we will store different formats in Hive Metastore. This causes the following issue:

Similar issues happen in Presto and Flink as well. It would be great if we can define a unified format in Hive Metastore for Delta.

Motivation

If we define a unifed format in Hive Metastore, and all of systems (Spark, Hive, Presto, Flink) use the same format, then no matter how a table is created, it can be accessed by all of systems.

dnskr commented 2 years ago

Hi @zsxwing, I use Spark 3.2.1 (Spark Thrift Server) to write Delta tables and Presto 271 to read them. Both Spark and Presto use shared Hive Metastore 3.1.2. I don't have any problems as I can see, so could you please elaborate more on the feature request?

zsxwing commented 2 years ago

For example, when reading Delta tables using Hive. We need to run the following command to create an external table:

CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '/delta/table/path'

In other words, Delta connector for Hive requires to see io.delta.hive.DeltaStorageHandler in the Hive Metastore.

However, if we use Spark to create a Delta table today, Spark won't write io.delta.hive.DeltaStorageHandler into the Hive Metastore. Then if you use Hive to read such table, it will fail because Hive doesn't know it needs to use io.delta.hive.DeltaStorageHandler to read this table.

In addition, if we run the above command in Hive, it will write io.delta.hive.DeltaStorageHandler into the Hive Metastore. When we use Spark to read such table, it will throw a ClassNotFoundException because io.delta.hive.DeltaStorageHandler doesn't exist in Spark.

I tried this before and hit the above issue. After we solve it, there may be other incompatibility issue I'm not aware as well.

TCGOGOGO commented 1 year ago

Hi @zsxwing Is there any plan when this feature could be done and released?

zsxwing commented 1 year ago

@TCGOGOGO this is a complicated issue. We haven't finished the entire investigation of how HMS works cross different engines. There is no ETA right now.

mtthsbrr commented 1 year ago

Hi @zsxwing, I use Spark 3.2.1 (Spark Thrift Server) to write Delta tables and Presto 271 to read them. Both Spark and Presto use shared Hive Metastore 3.1.2. I don't have any problems as I can see, so could you please elaborate more on the feature request?

@dnskr This is exactly what I'm trying to archive. Could you share some details about your setup? I would be especially interested in your Spark cluster with all its configs. Right now I'm struggling with the "Spark writing delta tables via the Hive Metastore in a way that Presto/Trino can read it" part.

dnskr commented 1 year ago

@dnskr This is exactly what I'm trying to archive. Could you share some details about your setup? I would be especially interested in your Spark cluster with all its configs. Right now I'm struggling with the "Spark writing delta tables via the Hive Metastore in a way that Presto/Trino can read it" part.

@mtthsbrr I'm running Spark Thrift Server in Kubernetes as Deployment (de facto as one Pod) with the following command:

/opt/spark/sbin/start-thriftserver.sh --name sts --conf spark.driver.host=$(hostname -I) --conf spark.kubernetes.driver.pod.name=$(hostname) && tail -f /opt/spark/logs/*.out

There is a basic config injected to Spark pods through ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  ...
data:
  spark-defaults.conf: >-
    spark.master                                      k8s://https://kubernetes.default.svc
    spark.kubernetes.namespace                        namespace-to-use
    spark.kubernetes.container.image                  private.registry/spark:v3.3.2-delta2.2.0
    spark.kubernetes.container.image.pullSecrets      private.registry.secret.key.name
    spark.kubernetes.authenticate.serviceAccountName  spark-service-account
    spark.hive.metastore.uris                         thrift://hive-metastore.hive-metastore-namespace.svc.cluster.local:9083
    spark.hive.server2.enable.doAs                    false
    spark.sql.catalog.spark_catalog                   org.apache.spark.sql.delta.catalog.DeltaCatalog
    spark.sql.extensions                              io.delta.sql.DeltaSparkSessionExtension
    # Driver and Executor resources properties
    ...

Custom private.registry/spark:v3.3.2-delta2.2.0 image is official Spark image plus delta-core and delta-storage jars.

Catalog definition for Presto, see Delta Lake Connector for more details:

connector.name=delta
hive.metastore.uri=thrift://hive-metastore.hive-metastore-namespace.svc.cluster.local:9083

Create table query executed in Spark Thrift Server:

CREATE OR REPLACE TABLE my_delta_table
USING DELTA
AS SELECT * FROM my_parquet_table;
jinmu0410 commented 1 year ago

Any conclusions about this? I'm having the same problem now

23/04/07 17:12:42 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table ods.t222 into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

jinmu0410 commented 1 year ago

@zsxwing

murggu commented 1 year ago

@zsxwing I would like to know if this coming soon. Reproduced the issue as per doc https://github.com/delta-io/connectors/tree/master/hive:

Such compute engines (e.g. Spark/Hive) interoperability would be nice, specially in shared hms scenarios.

umeshpawar2188 commented 1 year ago

When this will be available to use. One of our biggest customer who is using hive is not able to use delta data because of this.

We need interoperability between delta tables created in spark and hive.

caoergou commented 1 year ago

When this will be available to use. One of our biggest customer who is using hive is not able to use delta data because of this.

We need interoperability between delta tables created in spark and hive.

I find myself entangled in this bug as well. I'm eagerly awaiting assistance from someone who can help unravel this query.

alberttwong commented 8 months ago

Here's an example of getting it to work with Spark SQL, HMS, MinIO S3 and StarRocks. https://github.com/StarRocks/demo/tree/master/documentation-samples/deltalake

wangchao316 commented 4 months ago

@zsxwing , hi , Do we have any plans to implement compatibility on different systems?