Add Support for Apache Hudi on Apache Spark

dht7 commented 1 year ago

Describe the feature

Add support for Hive-Hudi Sync tool for registering a Hudi source as external table.

Describe alternatives you've considered

Hudi tables can be registered using the following syntax:

    - name: hudi_tbl
           description: "External table using Hudi format"
           external:
               location: <string>                     #S3 file path, GCS file path, DFS path
               using: hudi

This runs the following queries when staging the source tables:

drop table if exists hudi_tbl
create table hudi_tbl using hudi location "<location_string>"

While this method works well for file formats like: csv, parquet, etc., this can causes issues when run for Hudi which natively uses hive sync tool to register tables with Hive Metastore.

Additionally, the above mentioned syntax fails when staging source tables in a database with a different name other than its original database name:

java.lang.AssertionError: assertion failed: The database names from this hoodie path and this catalog table is not same.
  at scala.Predef$.assert(Predef.scala:223)
  at org.apache.spark.sql.catalyst.catalog.HoodieCatalogTable.initHoodieTable(HoodieCatalogTable.scala:183)
  at org.apache.spark.sql.hudi.command.CreateHoodieTableCommand.run(CreateHoodieTableCommand.scala:71)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)

codope commented 1 year ago

@vingov Can you please take a look? Basically, the goal is to load the hudi table on GCS as a source in dbt.

Sarfaraz-214 commented 1 year ago

Please check for the feasibility of reading Hudi tables from GCS in dbt via Hive Metastore.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbt-labs / dbt-external-tables

Add Support for Apache Hudi on Apache Spark #189

Describe the feature

Describe alternatives you've considered