dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.95k stars 1.63k forks source link

[Unit Testing] Allow explicit precision testing in unit tests #9884

Open emmyoop opened 7 months ago

emmyoop commented 7 months ago

Housekeeping

Short description

We cannot currently unit test precision. The expected outputs get rounded to be the expected datatypes and therefore are not correctly parsed.

from @MichelleArk in #9627

I think it would likely be in get_expected_sql, since that applies casting to the 'expected' structure based on the sql of the model being tested.

I think there are a couple ways we can approach a fix here:

Explicit option for users to opt out of the automatic casting of expect values. E.g. something called '_dbt_skip_casting' or '_dbt_use_literal'. I think this option should only be valid on the expect rows as opposed to input ones, since there isn't any benefit to being able to test your model given a different input schema as those are fixed in production.

Acceptance criteria

Suggested Tests

model.sql

with

source as (

    select * from {{ source('stripe',  'payment') }}

),

staged as (

    select
        id as payment_id,
        orderid as order_id,
        paymentmethod as payment_method,
        status,
        -- amount is stored in cents, convert it to dollars
        round(amount / 100, 2) as amount,
        created as created_at
    from source

)

select * from staged

failing unit test yaml

unit_tests:
  - name: test_dollar_conversion # this is the unique name of the test
    model: stg_payments # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: source('stripe',  'payment')
        rows:
         - {id: 2, amount: 50}
         - {id: 3, amount: 33.33}
    expect: # the expected output given the inputs above
      rows:
        - {payment_id: 2, amount: 0.5}
        - {payment_id: 3, amount: 0.3333}

passing unit test yaml

unit_tests:
  - name: test_dollar_conversion # this is the unique name of the test
    model: stg_payments # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: source('stripe',  'payment')
        rows:
         - {id: 2, amount: 50}
         - {id: 3, amount: 33.33}
    expect: # the expected output given the inputs above
      rows:
        - {payment_id: 2, amount: 0.50}
        - {payment_id: 3, amount: 0.33}

Impact to Other Teams

none

Will backports be required?

no

Context

Outcome of spike in https://github.com/dbt-labs/dbt-core/issues/9627

This will required a change to the macros in dbt-adapters and tests in dbt-core. The tests in dbt-core will not pass until a new version of dbt-adapters is released.

ChenyuLInx commented 6 months ago

@graciegoheen Can you take a look at spike outcome here and see whether you think this ticket is worth pursuing?

ChenyuLInx commented 6 months ago

We should document it do not work if we are not moving forward on this ticket.

graciegoheen commented 6 months ago

I think:

The configuration is:

I'll sync with unit testing squad on what interface would be best here

dbeatty10 commented 6 months ago

What would be the consequences if we cast the expected output to the widest scale/precision/character size for the relevant data type? i.e., numeric without precision and scale rather than numeric(10, 2)?

It might give a failing unit test the way that we are after.

My understanding is that we use "wide" data types by default when enforcing model contracts. I'm not sure if there's prior art we could re-use here.

### Code example `models/source_data.sql` ```sql select 2 as id, 50 as amount union all select 3 as id, 33.33 as amount ``` `models/stg_payments.sql` ```sql with source as ( select * from {{ ref("source_data") }} ), staged as ( select id as payment_id, -- amount is stored in cents, convert it to dollars cast(round(amount / 100, 2) as numeric(10, 2)) as amount from source ) select * from staged ``` `models/_unit.yml` ```yaml unit_tests: - name: test_dollar_conversion model: stg_payments given: - input: ref("source_data") rows: - {id: 2, amount: 50} - {id: 3, amount: 33.33} expect: rows: - {payment_id: 2, amount: 0.5} - {payment_id: 3, amount: 0.3333} ``` ```shell dbt build cp target/run/my_project/models/_unit.yml/models/test_dollar_conversion.sql analyses/test_dollar_conversion__current.sql cp analyses/test_dollar_conversion__current.sql analyses/test_dollar_conversion__proposed.sql ``` Then modify `analyses/test_dollar_conversion__proposed.sql` like this: ```diff 55c55 < cast(0.5 as numeric(10,2)) --- > cast(0.5 as numeric) 63c63 < cast(0.3333 as numeric(10,2)) --- > cast(0.3333 as numeric) ``` ```shell dbt show -s analyses/test_dollar_conversion__current.sql dbt show -s analyses/test_dollar_conversion__proposed.sql ```

Experiment

I tried changing this line to be column.dtype instead of column.data_type.

It worked when I tried it, but only if I quoted the "0.50" portion.

    expect:
      rows:
        - {payment_id: 2, amount: "0.50"}
        - {payment_id: 3, amount: 0.33}

Otherwise, I got this error:

image

I didn't try any other adapters other than dbt-postgres. The behavior might depend on the way each Python driver renders the value to a string 🤷.

Using the sql format like this also worked along with the data_type -> dtype change:

    expect:
        format: sql
        rows: |
          select 2 as payment_id, cast(0.50 as numeric(10, 2)) as amount union all
          select 3 as payment_id, cast(0.33 as numeric(10, 2)) as amount
KaneMorgan commented 3 months ago

We ran into this in and had a related discussion on the dbt slack here

What would be the consequences if we cast the expected output to the widest scale/precision/character size for the relevant data type?

For Redshift at least casting to just Numeric results in (18,0)

An option to use the types defined in our contract for the model would have been a nice way to solve this from our perspective. Being not too familiar with the implementation I thought the contract was the source of the truth for the expected data types. But that may not be possible if the contract uses "wide" data types