dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
405 stars 227 forks source link

[ADAP-931] [Bug] Values in seeds that should convert to `null` aren't working for `session` connection method #901

Open vakarisbk opened 1 year ago

vakarisbk commented 1 year ago

Is this a new bug in dbt-spark?

Current Behavior

Null values in csv seeds appear as string "None" or empty string in the materialized table when using Spark Session connection or the Spark Connect method in development on #899

Expected Behavior

Null values in seeds should be materialized as null values.

Steps To Reproduce

  1. Create a seed with null values

seed_null.csv:

type,value1,value2
empty,,
doublequotes,""," "
ticks,'',''
spaces, ,
stringnone,None,None
stringnull,Null,Null
nothing
  1. Create a model for testing null values:

test_null.sql:

SELECT
  type,
  value1,
  CASE
    WHEN value1 IS NULL THEN 'Actual NULL'
    WHEN value1 = 'None' THEN 'String "None"'
    WHEN value1 = '' THEN 'Empty String'
    ELSE 'Other'
  END AS value1_type,
  value2,
  CASE
    WHEN value2 IS NULL THEN 'Actual NULL'
    WHEN value2 = 'None' THEN 'String "None"'
    WHEN value2 = '' THEN 'Empty String'
    ELSE 'Other'
  END as value2_type
FROM {{ ref('seed_null') }}
  1. Run the model:
  2. Query the created table:
spark.sql("SELECT * from test_null").show()

+------------+------+-------------+------+-------------+
|        type|value1|  value1_type|value2|  value2_type|
+------------+------+-------------+------+-------------+
|     nothing|  None|String "None"|  None|String "None"|
|doublequotes|  None|String "None"|  None|String "None"|
|       empty|  None|String "None"|  None|String "None"|
|  stringnone|  None|String "None"|  None|String "None"|
|  stringnull|  None|String "None"|  None|String "None"|
|      spaces|  None|String "None"|  None|String "None"|
|       ticks|      | Empty String|      | Empty String|
+------------+------+-------------+------+-------------+

Relevant log output

No response

Environment

- OS:
- Python: 3.11.5
- dbt-core: 1.6.5
- dbt-spark: 1.6.0

Additional Context

No response

dbeatty10 commented 9 months ago

Thanks for reaching out @vakarisbk ! And thanks for providing such a well-constructed example 🏆

Could you try this?

dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10

This worked for me, and here's the output that I got:

16:35:50  Running with dbt=1.6.5
16:35:50  Registered adapter: spark=1.6.2
16:35:50  Unable to do partial parsing because a project config has changed
16:35:51  Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:35:51  
16:35:55  Concurrency: 1 threads (target='http')
16:35:55  
16:35:55  1 of 2 START seed file dbt_dbeatty.seed_null ................................... [RUN]
16:36:03  1 of 2 OK loaded seed file dbt_dbeatty.seed_null ............................... [CREATE 7 in 7.76s]
16:36:03  2 of 2 START sql view model dbt_dbeatty.test_null .............................. [RUN]
16:36:05  2 of 2 OK created sql view model dbt_dbeatty.test_null ......................... [OK in 2.42s]
16:36:06  
16:36:06  Finished running 1 seed, 1 view model in 0 hours 0 minutes and 15.45 seconds (15.45s).
16:36:06  
16:36:06  Completed successfully
16:36:06  
16:36:06  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
16:36:08  Running with dbt=1.6.5
16:36:08  Registered adapter: spark=1.6.2
16:36:08  Found 1 model, 1 seed, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:36:08  
16:36:11  Concurrency: 1 threads (target='http')
16:36:11  
16:36:13  Previewing inline node:
| type         | value1 | value1_type   | value2 | value2_type   |
| ------------ | ------ | ------------- | ------ | ------------- |
| doublequotes |        | Actual NULL   |        | Actual NULL   |
| empty        |        | Actual NULL   |        | Actual NULL   |
| nothing      |        | Actual NULL   |        | Actual NULL   |
| spaces       |        | Actual NULL   |        | Actual NULL   |
| stringnone   | None   | String "None" | None   | String "None" |
| stringnull   |        | Actual NULL   |        | Actual NULL   |
| ticks        | ''     | Other         | ''     | Other         |

Since this looks like it's working, I'm going to close this as "can't reproduce". But please let me know if you still think this is an issue and we can consider re-opening this.

vakarisbk commented 9 months ago

@dbeatty10 thanks for your response. I can see from your logs that you're using target=http which I assume may be connection to a Databricks cluster. This issue only appears in the session connection (local pyspark).

I just re-ran the tests and got my originally reported results back.

dbeatty10 commented 9 months ago

Thank you for that insight @vakarisbk -- will re-open this!

Nice eyes on http vs. session 🦅

Reprex

Using the example here, the following commands do give the expected output when using the ✅ http connection method, but give unexpected output when using the ❌ session method:

dbt build -s +test_null --full-refresh
dbt show --inline "select * from {{ ref('test_null') }} order by type" --limit 10

Acceptance criteria

Summary

We are unable to prioritize this ourselves, so I'm going to lable this as "help wanted".