aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
101 stars 69 forks source link

Fix null values handling in seeds #437

Closed stephanetrou closed 2 months ago

stephanetrou commented 2 months ago

resolves #430

Description

To reproduce the bug create a seed file `myseed.csv' with the following content :

adapter,version
glue,1.8.1
spanner,

Launch dbt seed and you'll get this output

❯ dbt seed
16:35:37  Running with dbt=1.8.6
16:35:37  Registered adapter: glue=1.8.1
16:35:38  Found 1 seed, 879 macros
16:35:38
16:35:38  Concurrency: 1 threads (target='dev-glue')
16:35:38
16:35:38  1 of 1 START seed file dbt.myseed .............................................. [RUN]
16:36:38  Glue adapter: Glue returned `error` for statement None for code

csv = [{"adapter": "glue", "version": "1.8.1"}, {"adapter": "spanner", "version": null}]
df = spark.createDataFrame(csv)
table_name = 'dbt.myseed'
if (spark.sql("show tables in dbt").where("tableName == 'myseed'").count() > 0):
    df.write        .mode("overwrite")        .format("parquet")        .insertInto(table_name, overwrite=True)
else:
    df.write        .option("path", "s3://dbt-glue-test-xxxxxxxxx/simple-test/dbt/myseed")        .format("parquet")        .saveAsTable(table_name)
SqlWrapper2.execute("""select * from dbt.myseed limit 1""")
, NameError: name 'null' is not defined
16:36:38  1 of 1 ERROR loading seed file dbt.myseed ...................................... [ERROR in 60.28s]
16:36:39
16:36:39  Finished running 1 seed in 0 hours 1 minutes and 1.01 seconds (61.01s).
16:36:39
16:36:39  Completed with 1 error and 0 warnings:
16:36:39
16:36:39    Database Error in seed myseed (seeds/myseed.csv)
  GlueCreateCsvFailed
16:36:39
16:36:39  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

As the jausanca write the problem come from json serialization.

The fix use json module to convert the json string to Python dict. Here an exemple of code produce with the fix applied.

❯ dbt seed
16:46:34  Running with dbt=1.8.6
16:46:34  Registered adapter: glue=1.8.1
16:46:35  Found 1 seed, 879 macros
16:46:35
16:46:35  Concurrency: 1 threads (target='dev-glue')
16:46:35
16:46:35  1 of 1 START seed file dbt.myseed .............................................. [RUN]

custom_glue_code_for_dbt_adapter
csv = [{'adapter': 'glue', 'version': '1.8.1'}, {'adapter': 'spanner', 'version': None}]
df = spark.createDataFrame(csv)
table_name = 'dbt.myseed'
if (spark.sql("show tables in dbt").where("tableName == 'myseed'").count() > 0):
    df.write        .mode("overwrite")        .format("parquet")        .insertInto(table_name, overwrite=True)
else:
    df.write        .option("path", "s3://dbt-glue-test-xxxxxxxxx/simple-test/dbt/myseed")        .format("parquet")        .saveAsTable(table_name)
SqlWrapper2.execute("""select * from dbt.myseed limit 1""")

16:46:40  1 of 1 OK loaded seed file dbt.myseed .......................................... [CREATE 2 in 4.70s]
16:46:40
16:46:40  Finished running 1 seed in 0 hours 0 minutes and 5.40 seconds (5.40s).
16:46:40
16:46:40  Completed successfully
16:46:40
16:46:40  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Checklist

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sanga8 commented 2 months ago

Nice, working on my side. @moomindani could you please take a look when you have some time ? thanks!

moomindani commented 2 months ago

Thank you for your contribution!