apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.33k stars 2.42k forks source link

[SUPPORT]Difference of show create table of hudi table from spark-sql and hive #11967

Open bithw1 opened 17 hours ago

bithw1 commented 17 hours ago

I am trying Hudi 0.15.0 and spark 3.3.0.

I have put the hive-site.xml under my $SPARK_HOME/conf, and I startup the spark sql with following command:

spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

Then, I create a table with following DDL from spark-sql cli:

CREATE TABLE hudi_table (
    ts BIGINT,
    uuid STRING,
    rider STRING,
    driver STRING,
    fare DOUBLE,
    city STRING
) USING HUDI
PARTITIONED BY (city)
LOCATION /tmp/hudi_table

The table is successfully created,but I got two questions here.

  1. When I run show tables with hive cli, I found that the hudi_table shows up in hive, it looks that the table def is synced to Hive, but I didn't enable hive sync with configuration like hoodie.datasource.meta.sync.enable or hoodie.datasource.hive_sync.mode or sth else, I would ask how this could happen.
  2. When I run show create table from the hive cli, it shows that it is an external table, which is correct, because I have specified the location in the DDL. But when I run show create table from the spark-sql cli, it shows that it is an non-external table, which is incorrect, looks to me this is a bug.
alberttwong commented 10 hours ago

what environment did you run the spark-sql. There must be spark configs somewhere on where to sync.

If the table is registered in HMS or glue or some data catalog, it isn't an external table.

bithw1 commented 3 hours ago

what environment did you run the spark-sql. There must be spark configs somewhere on where to sync.

If the table is registered in HMS or glue or some data catalog, it isn't an external table.

I am using Centos7,Spark 3.3,2, I don't make extra configuration for spark and spark sql, except copying hive-site.xml under the spark conf dir

alberttwong commented 3 hours ago

It has to register the tables somewhere. It may be the spark bundle you're using. If your use case is to how things are supposed to work, you can check out the new hudi docker demo that is being built. https://github.com/alberttwong/onehouse-demos/tree/main/hudi-spark-minio-trino