Error using HUDI with dbt-glue: `HoodieException: 'hoodie.table.name' must be set`

mmehrten commented 1 year ago

Describe the bug

Running dbt run on a simple dbt mode with dbt-glue gives HoodieException: 'hoodie.table.name' must be set when trying to use Apache HUDI.

Steps To Reproduce

HUDI installed via JAR and a custom connector in AWS Glue (code is running in GovCloud where AWS Marketplace extensions are not available).

HUDI Jar
Class name: org.apache.hudi
Connection name: hudi_connection

Profiles.yml:

govcloud_demo:
  outputs:
    dev:
      type: glue
      query-comment: Glue DBT
      role_arn: role
      region: us-gov-west-1
      glue_version: "3.0"
      workers: 2
      worker_type: G.1X
      idle_timeout: 10
      schema: "analytics"
      database: "analytics"
      session_provisioning_timeout_in_seconds: 120
      location: "s3://data/path/"
      connections: hudi_connection
      conf: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
      default_arguments: "--enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://logs/path/"
  target: dev

dbt_project.yml:

name: 'govcloud_demo'
version: '1.0.0'
config-version: 2
profile: 'govcloud_demo'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"
models:
  +file_format: hudi
  govcloud_demo:
    example:
      +materialized: view

Model.sql: (note: Have used file_format=hudi here - same behavior occurs whether configured in dbt_project.yml or in the model file).

{{ config(materialized='table') }}
with source_data as (
    select 1 as id,
    "b" AS anothercol
)
select *
from source_data

Expected behavior

Model runs and creates in Glue catalog / S3.

Screenshots and log output

22:45:01      '''), Py4JJavaError: An error occurred while calling o86.sql.
22:45:01    : org.apache.hudi.exception.HoodieException: 'hoodie.table.name' must be set.
22:45:01        at org.apache.hudi.common.config.HoodieConfig.getStringOrThrow(HoodieConfig.java:237)
22:45:01        at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:98)
22:45:01        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:144)
22:45:01        at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:530)
...

System information

The output of dbt --version:

Core:
  - installed: 1.3.1
  - latest:    1.3.1 - Up to date!

Plugins:
  - spark: 1.3.0 - Up to date!

The operating system you're using: MacOS

The output of python --version: Python 3.10.8

armaseg commented 1 year ago

Hello @mmehrten, If I understood well, you are only adding hudi-spark3-bundle_2.12-0.12.1.jar to you connector. You might need additional Jars to make it work.

Please, try to add the missing Jars (more information on @l-jhon comment https://github.com/aws-samples/dbt-glue/issues/92#issuecomment-1289477161) to your connections or use the jars directly as he is doing without using a Glue connection.

Thanks,

PS: Thanks also to @l-jhon :)

mmehrten commented 1 year ago

I get the same hoodie.table.name error when providing the HUDI JARs via the extra_jars profiles config.

The hoodie.table.name setting seems like something that would / should be coming from dbt-spark or dbt-glue, based on the databrew setup in this immersion day. I'm not familiar with the connector code, but I can try to look for other possible issues today. Any other guidance you can provide would be fantastic!

type: glue
query-comment: Glue DBT
role_arn: ...
region: us-gov-west-1
glue_version: "3.0"
workers: 2
worker_type: G.1X
idle_timeout: 10
schema: "analytics"
database: "analytics"
session_provisioning_timeout_in_seconds: 120
location: "s3://..."
conf: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
extra_jars: s3://.../hudi-utilities-bundle_2.12-0.12.1.jar,s3://.../hudi-spark3.1-bundle_2.12-0.12.1.jar,s3://.../spark-avro_2.12-3.2.2.jar,s3://.../calcite-core-1.32.0.jar
default_arguments: "--enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://.../dbt/"

mmehrten commented 1 year ago

I found the setting in the merge specific API in the dbt-glue impl.py - maybe something is missing from other APIs?

Confirmed that changing to materialized='incremental' and incremental_strategy='merge' worked, so that makes me think only the incremental materialization is fully supported with HUDI right now?

By the way - leaving out the unique_key configuration gave a very strange error - I'll open a new issue to improve input validation to avoid that in the future, and PR a fix.

mmehrten commented 1 year ago

@armaseg any other ideas about what could be going wrong here?

JARs I'm using here:

hudi-utilities-bundle_2.12-0.12.1.jar
hudi-spark3.1-bundle_2.12-0.12.1.jar
spark-avro_2.12-3.2.2.jar
calcite-core-1.32.0.jar

The error from Glue makes me think that there's a HUDI configuration for the table name that dbt-glue or dbt-spark isn't setting, but I could be wrong.

l-jhon commented 1 year ago

When you run the command to check the connection with Glue, what's the output? You can run the command dbt debug --profiles-dir profile for example.

@armaseg any other ideas about what could be going wrong here?

JARs I'm using here:
hudi-utilities-bundle_2.12-0.12.1.jar
hudi-spark3.1-bundle_2.12-0.12.1.jar
spark-avro_2.12-3.2.2.jar
calcite-core-1.32.0.jar
The error from Glue makes me think that there's a HUDI configuration for the table name that dbt-glue or dbt-spark isn't setting, but I could be wrong.

osalloum commented 1 year ago

I am getting the same result by using spark.sql with hudi 0.12.1 and glue interactive which dbt uses

spark.sql("""
create table if not exists some_table using hudi
location 's3://some_bucket/hudi-tables/some_table/' options (
    type = 'cow',
    primaryKey = 'email_id',
    preCombineField = 'created_at'
)
partitioned by (year, month, day) as
SELECT *
FROM source_schema.some_table;""")

I also tried overriding the jar loading configuration

%%configure -f
{"conf": {"spark.driver.userClassPathFirst": "true","spark.executor.userClassPathFirst": "true"}}

%number_of_workers 2
%glue_version 3.0
%number_of_workers 2
%extra_jars s3://some_bucket/sparkjars/hudi-common-0.12.1.jar,s3://some_bucket/sparkjars/hudi-spark3.1-bundle_2.12-0.12.1.jar,s3://some_bucket/sparkjars/hudi-utilities_2.12-0.12.1.jar,s3://some_bucket/sparkjars/calcite-core-1.32.0.jar
%spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
%spark_conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
%spark_conf spark.hudi.use.glue.catalog=true
%spark_conf spark.jars.package=org.apache.hudi:hudi-spark3.1-bundle_2.12:0.12.1
%spark_conf spark.sql.hive.convertMetastoreParquet=false

However if specify the hoodie.table.name in the spark sql i get a different error related to KyroSerializer even though it is set

spark.sql("""
create table if not exists some_table using hudi
location 's3://some_bucket/hudi-tables/some_table/' options (
    type = 'cow',
    primaryKey = 'email_id',
    preCombineField = 'created_at',
        hoodie.table.name = 'some_table'
)
partitioned by (year, month, day) as
SELECT *
FROM source_schema.some_table;""")

Py4JJavaError: An error occurred while calling o72.sql.
: org.apache.hudi.exception.HoodieException: hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:106)

aws-samples / dbt-glue