Open zsxwing opened 2 years ago
Hi @zsxwing, I use Spark 3.2.1 (Spark Thrift Server) to write Delta tables and Presto 271 to read them. Both Spark and Presto use shared Hive Metastore 3.1.2. I don't have any problems as I can see, so could you please elaborate more on the feature request?
For example, when reading Delta tables using Hive. We need to run the following command to create an external table:
CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '/delta/table/path'
In other words, Delta connector for Hive requires to see io.delta.hive.DeltaStorageHandler
in the Hive Metastore.
However, if we use Spark to create a Delta table today, Spark won't write io.delta.hive.DeltaStorageHandler
into the Hive Metastore. Then if you use Hive to read such table, it will fail because Hive doesn't know it needs to use io.delta.hive.DeltaStorageHandler
to read this table.
In addition, if we run the above command in Hive, it will write io.delta.hive.DeltaStorageHandler
into the Hive Metastore. When we use Spark to read such table, it will throw a ClassNotFoundException because io.delta.hive.DeltaStorageHandler
doesn't exist in Spark.
I tried this before and hit the above issue. After we solve it, there may be other incompatibility issue I'm not aware as well.
Hi @zsxwing Is there any plan when this feature could be done and released?
@TCGOGOGO this is a complicated issue. We haven't finished the entire investigation of how HMS works cross different engines. There is no ETA right now.
Hi @zsxwing, I use Spark 3.2.1 (Spark Thrift Server) to write Delta tables and Presto 271 to read them. Both Spark and Presto use shared Hive Metastore 3.1.2. I don't have any problems as I can see, so could you please elaborate more on the feature request?
@dnskr This is exactly what I'm trying to archive. Could you share some details about your setup? I would be especially interested in your Spark cluster with all its configs. Right now I'm struggling with the "Spark writing delta tables via the Hive Metastore in a way that Presto/Trino can read it" part.
@dnskr This is exactly what I'm trying to archive. Could you share some details about your setup? I would be especially interested in your Spark cluster with all its configs. Right now I'm struggling with the "Spark writing delta tables via the Hive Metastore in a way that Presto/Trino can read it" part.
@mtthsbrr I'm running Spark Thrift Server in Kubernetes as Deployment (de facto as one Pod) with the following command:
/opt/spark/sbin/start-thriftserver.sh --name sts --conf spark.driver.host=$(hostname -I) --conf spark.kubernetes.driver.pod.name=$(hostname) && tail -f /opt/spark/logs/*.out
There is a basic config injected to Spark pods through ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
...
data:
spark-defaults.conf: >-
spark.master k8s://https://kubernetes.default.svc
spark.kubernetes.namespace namespace-to-use
spark.kubernetes.container.image private.registry/spark:v3.3.2-delta2.2.0
spark.kubernetes.container.image.pullSecrets private.registry.secret.key.name
spark.kubernetes.authenticate.serviceAccountName spark-service-account
spark.hive.metastore.uris thrift://hive-metastore.hive-metastore-namespace.svc.cluster.local:9083
spark.hive.server2.enable.doAs false
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
# Driver and Executor resources properties
...
Custom private.registry/spark:v3.3.2-delta2.2.0
image is official Spark image plus delta-core
and delta-storage
jars.
Catalog definition for Presto, see Delta Lake Connector for more details:
connector.name=delta
hive.metastore.uri=thrift://hive-metastore.hive-metastore-namespace.svc.cluster.local:9083
Create table query executed in Spark Thrift Server:
CREATE OR REPLACE TABLE my_delta_table
USING DELTA
AS SELECT * FROM my_parquet_table;
Any conclusions about this? I'm having the same problem now
23/04/07 17:12:42 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table ods
.t222
into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
@zsxwing
@zsxwing I would like to know if this coming soon. Reproduced the issue as per doc https://github.com/delta-io/connectors/tree/master/hive:
Such compute engines (e.g. Spark/Hive) interoperability would be nice, specially in shared hms scenarios.
When this will be available to use. One of our biggest customer who is using hive is not able to use delta data because of this.
We need interoperability between delta tables created in spark and hive.
When this will be available to use. One of our biggest customer who is using hive is not able to use delta data because of this.
We need interoperability between delta tables created in spark and hive.
I find myself entangled in this bug as well. I'm eagerly awaiting assistance from someone who can help unravel this query.
Here's an example of getting it to work with Spark SQL, HMS, MinIO S3 and StarRocks. https://github.com/StarRocks/demo/tree/master/documentation-samples/deltalake
@zsxwing , hi , Do we have any plans to implement compatibility on different systems?
Feature request
Overview
Currently when we creates a Delta table in Hive Metastore using different systems, we will store different formats in Hive Metastore. This causes the following issue:
Similar issues happen in Presto and Flink as well. It would be great if we can define a unified format in Hive Metastore for Delta.
Motivation
If we define a unifed format in Hive Metastore, and all of systems (Spark, Hive, Presto, Flink) use the same format, then no matter how a table is created, it can be accessed by all of systems.