dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.28k stars 1.54k forks source link

[Feature] Support Semi-Structured Data Columns in Unit Test dict fixture data #10161

Open ronco opened 1 month ago

ronco commented 1 month ago

Is this your first time submitting a feature request?

Describe the feature

When writing unit tests that include semi-structured data columns (SUPER in Redshift for example), the fixture data can only be staged via SQL. This feature calls for understanding structured data in dict fixtures.

An example fixture definition would be something like this:

- input: ref('table_with_super')
  format: dict
  rows:
    - { structured_column: {foo: bar}, int_column: 123 }

Describe alternatives you've considered

This functionality can be achieved using SQL fixtures like so: SELECT json_parse('{"foo":"bar"}')::SUPER as structured_column This is quite cumbersome as you have to specify all significant columns in your sql.

Who will this benefit?

Anyone making use of semi-structured data columns.

Are you interested in contributing this feature?

I am open to contributing if this is of interest.

Anything else?

No response

graciegoheen commented 1 month ago

Hi! I believe we do support json structured types in unit test definitions using format: dict (see our docs here).

You'll need to use quotes though like so:

- input: ref('table_with_super')
  format: dict
  rows:
    - { structured_column: '{"foo": "bar"}', int_column: 123 }
ronco commented 1 month ago

Hey @graciegoheen , thanks for getting back to me. I've given your suggestion a try and I'm still running into difficulty. When I have this input:

          - { created_at: '2024-05-17 21:05:00', extra_info: '{"amount_usd": 50}' }

it generates this SQL for the fixture data: cast('{"amount_usd": 50}' as super) When I try to query that fixture I get NULL:

with input_data as (select    cast('{"amount_usd": 50}' as super) as extra_info)
select extra_info.amount_usd from input_data

amount_usd|
----------+
          |

It seems the straight casting of the string to super isn't decomposing the data into structured values. However, if I switch the SQL to use json_parse I get the expected value:

with input_data as (select    cast(json_parse('{"amount_usd": 50}') as super) as extra_info)
select extra_info.amount_usd from input_data

amount_usd|
----------+
50        |

Is there a setting I'm missing for the Redshift adapter perhaps to make this work? Any other suggestions?

Thanks!

dbeatty10 commented 1 month ago

Hey @ronco -- @graciegoheen and I discussed this a little bit yesterday.

You are spot-on with the following:

It seems the straight casting of the string to super isn't decomposing the data into structured values

This is because dbt unit testing doesn't do any unnesting / decomposition of the SUPER object -- it just leaves it as the entire object.

In order to do a unit test for one of the nested values, then you'll need to supply expected values for all of the nested values as well. See below for a full example.

If this doesn't work for you, could you share an example of your YAML for your unit test and your SQL models and we'll take another look?

More detail

It looks like the key insight in dbt-redshift is that:

This is similar to a caveat in dbt-bigquery:

You must specify all fields in a BigQuery STRUCT in a unit test. You cannot use only a subset of fields in a STRUCT.

Example

models/stg_data_types.sql

select
    1 as int_field,
    2.0 as float_field, 
    1 as numeric_field,
    'my_string' as str_field,
    'my,cool''string' as str_escaped_field,
    true as bool_field,
    cast('2020-01-02' as date) as date_field,
    cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
    cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
    cast(json_parse('{"bar": "baz", "balance": 7.77, "active": false}') as super) as json_field

models/fct_data_types.sql

select * from {{ ref("stg_data_types") }}
unit_tests:
  - name: test_my_data_types
    model: fct_data_types
    given:
      - input: ref('stg_data_types')
        rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
    expect:
      rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'

Note how the last line in the unit test includes bar, balance, and active, even though you might only care about testing the value of balance.

ronco commented 1 month ago

Hey @dbeatty10 , thanks for all this guidance. I am a little confused on how to pull this off. We're using our SUPER field for flexible data storage. In other words, there is no underlying defined STRUCT, the content varies from row to row. So I'm not sure what values I would need to add.

As requested, here are sample files illustrating the issue: stg_data_types.sql:

select
    1 as int_field,
    2.0 as float_field, 
    1 as numeric_field,
    'my_string' as str_field,
    'my,cool''string' as str_escaped_field,
    true as bool_field,
    cast('2020-01-02' as date) as date_field,
    cast('2013-11-03 00:00:00-0' as timestamp without time zone) as timestamp_field,
    cast('2013-11-03 00:00:00-0' as timestamp with time zone) as timestamptz_field,
    cast('' as super) as json_field

and fct_data_types.sql:

select str_field,
sum(json_field.balance) as sum_balance
from {{ ref("stg_data_types") }}
group by 1

and finally schema.yml:

unit_tests:
  - name: test_my_data_types
    model: fct_data_types
    given:
      - input: ref('stg_data_types')
        rows:
         - int_field: 1
           float_field: 2.0
           numeric_field: 1
           str_field: my_string
           str_escaped_field: "my,cool'string"
           bool_field: true
           date_field: 2020-01-02
           timestamp_field: 2013-11-03 00:00:00-0
           timestamptz_field: 2013-11-03 00:00:00-0
           json_field: '{"bar": "baz", "balance": 7.77, "active": false}'
    expect:
      rows:
         - str_field: 'my_string'
           sum_balance: 7.77

When I run that test I get the following output:

17:06:28  1 of 1 START unit_test fct_data_types::test_my_data_types ...................... [RUN]
17:06:30  1 of 1 FAIL 1 fct_data_types::test_my_data_types ............................... [FAIL 1 in 2.58s]
17:06:31
17:06:31  Finished running 1 unit test in 0 hours 0 minutes and 6.89 seconds (6.89s).
17:06:31
17:06:31  Completed with 1 error and 0 warnings:
17:06:31
17:06:31  Failure in unit_test test_my_data_types (models/dbt-test/schema.yml)
17:06:31

actual differs from expected:

@@,str_field,sum_balance
→ ,my_string,7.77→NULL

I think the problem is the cast is treating the json_field as a plain string, rather than a structured field I can extract data from. Let me know if there's anything else I can provide to get to the bottom of this.

Thanks!

ronco commented 1 month ago

Any further thoughts here? Thanks!

graciegoheen commented 3 weeks ago

After thinking a bit more about this with @dbeatty10 - it sounds like the request is to update the safe_cast macro to apply a json_parse to the string input when casting to SUPER. That way, the SQL statement would look like:

cast(json_parse('{"bar": "baz", "balance": 7.77, "active": false}') as super)

instead of:

cast('{"bar": "baz", "balance": 7.77, "active": false}' as super)

We are worried there may be unintended consequences from making this generalized change - from looking at Redshift's documentation, it looks like the SUPER data type can be used for Parquet, JSON, TEXT, or CSV. So we wouldn't want to add the json_parse in all situations.

For example, if you did want to cast a TEXT to SUPER, you would want the sql to look like:

cast('my_text' as super)

If it's not a generalization we can make, we'd likely stick to requiring you to input this data type using format: sql.

ronco commented 3 weeks ago

Hey @graciegoheen , thanks for getting back to me. You've got the gist of the problem. I understand your concern about making it a generalized solution. Thanks for digging in on it. I'm wondering, is there another approach we could take to get this to work? Doing format: sql can be a bit painful, particularly when working with wide tables. Maybe there could be a way to provide the cast/sql statement for individual columns? Or maybe if a dict is provided as the value of a field only then apply json_parse? Thanks for your consideration.