Open mmehrten opened 1 year ago
Hello @mmehrten, If I understood well, you are only adding hudi-spark3-bundle_2.12-0.12.1.jar to you connector. You might need additional Jars to make it work.
Please, try to add the missing Jars (more information on @l-jhon comment https://github.com/aws-samples/dbt-glue/issues/92#issuecomment-1289477161) to your connections or use the jars directly as he is doing without using a Glue connection.
Thanks,
PS: Thanks also to @l-jhon :)
I get the same hoodie.table.name
error when providing the HUDI JARs via the extra_jars
profiles config.
The hoodie.table.name
setting seems like something that would / should be coming from dbt-spark
or dbt-glue
, based on the databrew setup in this immersion day. I'm not familiar with the connector code, but I can try to look for other possible issues today. Any other guidance you can provide would be fantastic!
type: glue
query-comment: Glue DBT
role_arn: ...
region: us-gov-west-1
glue_version: "3.0"
workers: 2
worker_type: G.1X
idle_timeout: 10
schema: "analytics"
database: "analytics"
session_provisioning_timeout_in_seconds: 120
location: "s3://..."
conf: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
extra_jars: s3://.../hudi-utilities-bundle_2.12-0.12.1.jar,s3://.../hudi-spark3.1-bundle_2.12-0.12.1.jar,s3://.../spark-avro_2.12-3.2.2.jar,s3://.../calcite-core-1.32.0.jar
default_arguments: "--enable-metrics=true, --enable-continuous-cloudwatch-log=true, --enable-continuous-log-filter=true, --enable-spark-ui=true, --spark-event-logs-path=s3://.../dbt/"
I found the setting in the merge
specific API in the dbt-glue
impl.py - maybe something is missing from other APIs?
Confirmed that changing to materialized='incremental'
and incremental_strategy='merge'
worked, so that makes me think only the incremental materialization is fully supported with HUDI right now?
By the way - leaving out the unique_key
configuration gave a very strange error - I'll open a new issue to improve input validation to avoid that in the future, and PR a fix.
@armaseg any other ideas about what could be going wrong here?
JARs I'm using here:
hudi-utilities-bundle_2.12-0.12.1.jar
hudi-spark3.1-bundle_2.12-0.12.1.jar
spark-avro_2.12-3.2.2.jar
calcite-core-1.32.0.jar
The error from Glue makes me think that there's a HUDI configuration for the table name that dbt-glue
or dbt-spark
isn't setting, but I could be wrong.
When you run the command to check the connection with Glue, what's the output?
You can run the command dbt debug --profiles-dir profile
for example.
@armaseg any other ideas about what could be going wrong here?
JARs I'm using here:
hudi-utilities-bundle_2.12-0.12.1.jar hudi-spark3.1-bundle_2.12-0.12.1.jar spark-avro_2.12-3.2.2.jar calcite-core-1.32.0.jar
The error from Glue makes me think that there's a HUDI configuration for the table name that
dbt-glue
ordbt-spark
isn't setting, but I could be wrong.
I am getting the same result by using spark.sql with hudi 0.12.1 and glue interactive which dbt uses
spark.sql("""
create table if not exists some_table using hudi
location 's3://some_bucket/hudi-tables/some_table/' options (
type = 'cow',
primaryKey = 'email_id',
preCombineField = 'created_at'
)
partitioned by (year, month, day) as
SELECT *
FROM source_schema.some_table;""")
I also tried overriding the jar loading configuration
%%configure -f
{"conf": {"spark.driver.userClassPathFirst": "true","spark.executor.userClassPathFirst": "true"}}
%number_of_workers 2
%glue_version 3.0
%number_of_workers 2
%extra_jars s3://some_bucket/sparkjars/hudi-common-0.12.1.jar,s3://some_bucket/sparkjars/hudi-spark3.1-bundle_2.12-0.12.1.jar,s3://some_bucket/sparkjars/hudi-utilities_2.12-0.12.1.jar,s3://some_bucket/sparkjars/calcite-core-1.32.0.jar
%spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
%spark_conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
%spark_conf spark.hudi.use.glue.catalog=true
%spark_conf spark.jars.package=org.apache.hudi:hudi-spark3.1-bundle_2.12:0.12.1
%spark_conf spark.sql.hive.convertMetastoreParquet=false
However if specify the hoodie.table.name in the spark sql i get a different error related to KyroSerializer even though it is set
spark.sql("""
create table if not exists some_table using hudi
location 's3://some_bucket/hudi-tables/some_table/' options (
type = 'cow',
primaryKey = 'email_id',
preCombineField = 'created_at',
hoodie.table.name = 'some_table'
)
partitioned by (year, month, day) as
SELECT *
FROM source_schema.some_table;""")
Py4JJavaError: An error occurred while calling o72.sql.
: org.apache.hudi.exception.HoodieException: hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:106)
Describe the bug
Running
dbt run
on a simple dbt mode with dbt-glue givesHoodieException: 'hoodie.table.name' must be set
when trying to use Apache HUDI.Steps To Reproduce
HUDI installed via JAR and a custom connector in AWS Glue (code is running in GovCloud where AWS Marketplace extensions are not available).
Profiles.yml:
dbt_project.yml:
Model.sql: (note: Have used file_format=hudi here - same behavior occurs whether configured in dbt_project.yml or in the model file).
Expected behavior
Model runs and creates in Glue catalog / S3.
Screenshots and log output
System information
The output of
dbt --version
:The operating system you're using: MacOS
The output of
python --version
: Python 3.10.8