Open ronco opened 1 month ago
Hi! I believe we do support json structured types in unit test definitions using format: dict
(see our docs here).
You'll need to use quotes though like so:
- input: ref('table_with_super')
format: dict
rows:
- { structured_column: '{"foo": "bar"}', int_column: 123 }
Hey @graciegoheen , thanks for getting back to me. I've given your suggestion a try and I'm still running into difficulty. When I have this input:
- { created_at: '2024-05-17 21:05:00', extra_info: '{"amount_usd": 50}' }
it generates this SQL for the fixture data: cast('{"amount_usd": 50}' as super)
When I try to query that fixture I get NULL:
with input_data as (select cast('{"amount_usd": 50}' as super) as extra_info)
select extra_info.amount_usd from input_data
amount_usd|
----------+
|
It seems the straight casting of the string to super isn't decomposing the data into structured values. However, if I switch the SQL to use json_parse
I get the expected value:
with input_data as (select cast(json_parse('{"amount_usd": 50}') as super) as extra_info)
select extra_info.amount_usd from input_data
amount_usd|
----------+
50 |
Is there a setting I'm missing for the Redshift adapter perhaps to make this work? Any other suggestions?
Thanks!
Hey @ronco -- @graciegoheen and I discussed this a little bit yesterday.
You are spot-on with the following:
It seems the straight casting of the string to super isn't decomposing the data into structured values
This is because dbt unit testing doesn't do any unnesting / decomposition of the SUPER
object -- it just leaves it as the entire object.
In order to do a unit test for one of the nested values, then you'll need to supply expected values for all of the nested values as well. See below for a full example.
If this doesn't work for you, could you share an example of your YAML for your unit test and your SQL models and we'll take another look?
It looks like the key insight in dbt-redshift is that:
SUPER
in a unit test. You cannot use only a subset of fields in a SUPER
.This is similar to a caveat in dbt-bigquery:
You must specify all fields in a BigQuery
STRUCT
in a unit test. You cannot use only a subset of fields in aSTRUCT
.
models/stg_data_types.sql
select
1 as int_field,
2.0 as float_field,
1 as numeric_field,
'my_string' as str_field,
'my,cool''string' as str_escaped_field,
true as bool_field,
cast('2020-01-02' as date) as date_field,
cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
cast(json_parse('{"bar": "baz", "balance": 7.77, "active": false}') as super) as json_field
models/fct_data_types.sql
select * from {{ ref("stg_data_types") }}
unit_tests:
- name: test_my_data_types
model: fct_data_types
given:
- input: ref('stg_data_types')
rows:
- int_field: 1
float_field: 2.0
numeric_field: 1
str_field: my_string
str_escaped_field: "my,cool'string"
bool_field: true
date_field: 2020-01-02
timestamp_field: 2013-11-03 00:00:00-0
timestamptz_field: 2013-11-03 00:00:00-0
json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
expect:
rows:
- int_field: 1
float_field: 2.0
numeric_field: 1
str_field: my_string
str_escaped_field: "my,cool'string"
bool_field: true
date_field: 2020-01-02
timestamp_field: 2013-11-03 00:00:00-0
timestamptz_field: 2013-11-03 00:00:00-0
json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
Note how the last line in the unit test includes bar
, balance
, and active
, even though you might only care about testing the value of balance
.
Hey @dbeatty10 , thanks for all this guidance. I am a little confused on how to pull this off. We're using our SUPER field for flexible data storage. In other words, there is no underlying defined STRUCT, the content varies from row to row. So I'm not sure what values I would need to add.
As requested, here are sample files illustrating the issue:
stg_data_types.sql
:
select
1 as int_field,
2.0 as float_field,
1 as numeric_field,
'my_string' as str_field,
'my,cool''string' as str_escaped_field,
true as bool_field,
cast('2020-01-02' as date) as date_field,
cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
cast('' as super) as json_field
and fct_data_types.sql
:
select str_field,
sum(json_field.balance) as sum_balance
from {{ ref("stg_data_types") }}
group by 1
and finally schema.yml
:
unit_tests:
- name: test_my_data_types
model: fct_data_types
given:
- input: ref('stg_data_types')
rows:
- int_field: 1
float_field: 2.0
numeric_field: 1
str_field: my_string
str_escaped_field: "my,cool'string"
bool_field: true
date_field: 2020-01-02
timestamp_field: 2013-11-03 00:00:00-0
timestamptz_field: 2013-11-03 00:00:00-0
json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
expect:
rows:
- str_field: 'my_string'
sum_balance: 7.77
When I run that test I get the following output:
17:06:28 1 of 1 START unit_test fct_data_types::test_my_data_types ...................... [RUN]
17:06:30 1 of 1 FAIL 1 fct_data_types::test_my_data_types ............................... [FAIL 1 in 2.58s]
17:06:31
17:06:31 Finished running 1 unit test in 0 hours 0 minutes and 6.89 seconds (6.89s).
17:06:31
17:06:31 Completed with 1 error and 0 warnings:
17:06:31
17:06:31 Failure in unit_test test_my_data_types (models/dbt-test/schema.yml)
17:06:31
actual differs from expected:
@@,str_field,sum_balance
→ ,my_string,7.77→NULL
I think the problem is the cast is treating the json_field as a plain string, rather than a structured field I can extract data from. Let me know if there's anything else I can provide to get to the bottom of this.
Thanks!
Any further thoughts here? Thanks!
After thinking a bit more about this with @dbeatty10 - it sounds like the request is to update the safe_cast
macro to apply a json_parse
to the string input when casting to SUPER
. That way, the SQL statement would look like:
cast(json_parse('{"bar": "baz", "balance": 7.77, "active": false}') as super)
instead of:
cast('{"bar": "baz", "balance": 7.77, "active": false}' as super)
We are worried there may be unintended consequences from making this generalized change - from looking at Redshift's documentation, it looks like the SUPER
data type can be used for Parquet
, JSON
, TEXT
, or CSV
. So we wouldn't want to add the json_parse
in all situations.
For example, if you did want to cast a TEXT
to SUPER
, you would want the sql to look like:
cast('my_text' as super)
If it's not a generalization we can make, we'd likely stick to requiring you to input this data type using format: sql
.
Hey @graciegoheen , thanks for getting back to me. You've got the gist of the problem. I understand your concern about making it a generalized solution. Thanks for digging in on it.
I'm wondering, is there another approach we could take to get this to work? Doing format: sql
can be a bit painful, particularly when working with wide tables. Maybe there could be a way to provide the cast/sql statement for individual columns? Or maybe if a dict is provided as the value of a field only then apply json_parse
?
Thanks for your consideration.
Is this your first time submitting a feature request?
Describe the feature
When writing unit tests that include semi-structured data columns (SUPER in Redshift for example), the fixture data can only be staged via SQL. This feature calls for understanding structured data in dict fixtures.
An example fixture definition would be something like this:
Describe alternatives you've considered
This functionality can be achieved using SQL fixtures like so:
SELECT json_parse('{"foo":"bar"}')::SUPER as structured_column
This is quite cumbersome as you have to specify all significant columns in your sql.Who will this benefit?
Anyone making use of semi-structured data columns.
Are you interested in contributing this feature?
I am open to contributing if this is of interest.
Anything else?
No response