dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.92k stars 1.63k forks source link

[CT-3496] allow "missing" inputs for inputs where you want to use empty mock data for unit test #9282

Open graciegoheen opened 11 months ago

graciegoheen commented 11 months ago

Is this your first time submitting a feature request?

Describe the feature

Screenshot 2023-12-13 at 12 15 18 PM

Let's say I'm working in the above dbt project.

I want to add a unit test to my dim_wizards - specifically, I want to test my is_valid_email_address logic:

with wizards as (

    select * from {{ ref('stg_wizards') }}

),

worlds as (

    select * from {{ ref('stg_worlds') }}

),

accepted_email_domains as (

    select * from {{ ref('top_level_email_domains') }}

),

check_valid_emails as (

    select  
        wizards.wizard_id,
        wizards.wizard_name,
        wizards.email,
        wizards.phone_number,
        wizards.world_id,

        coalesce (regexp_like(
            wizards.email, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'
        )
        = true
        and accepted_email_domains.tld is not null,
        false) as is_valid_email_address

    from wizards
    left join accepted_email_domains
        on wizards.email_top_level_domain = lower(accepted_email_domains.tld)

)

select
    check_valid_emails.wizard_id,
    check_valid_emails.wizard_name,
    check_valid_emails.email,
    check_valid_emails.is_valid_email_address,
    check_valid_emails.phone_number,
    worlds.world_name
from check_valid_emails
left join worlds
    on check_valid_emails.world_id = worlds.world_id

Even though dim_wizards depends on three nodes:

I only want to supply mock input data for the nodes that are relevant to my test.

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: cool@example.com,     email_top_level_domain: example.com}
          - {wizard_id: 2, email: cool@unknown.com,     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

In this case, that's just a few columns from stg_wizards and a column from top_level_email_domains. The stg_worlds model is completely irrelevant for this unit test.

If I don't include stg_worlds in my unit test definition, however, dbt build doesn't work quite right:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:02:26  Running with dbt=1.8.0-a1
18:02:26  Registered adapter: postgres=1.8.0-a1
18:02:27  Found 6 models, 5 seeds, 0 sources, 0 exposures, 0 metrics, 409 macros, 0 groups, 0 semantic models, 1 unit test
18:02:27  
18:02:27  Concurrency: 5 threads (target='postgres')
18:02:27  
18:02:27  1 of 11 START seed file dbt_dev.orders ......................................... [RUN]
18:02:27  2 of 11 START seed file dbt_dev.top_level_email_domains ........................ [RUN]
18:02:27  3 of 11 START seed file dbt_dev.wands .......................................... [RUN]
18:02:27  4 of 11 START seed file dbt_dev.wizards ........................................ [RUN]
18:02:27  5 of 11 START seed file dbt_dev.worlds ......................................... [RUN]
18:02:27  1 of 11 OK loaded seed file dbt_dev.orders ..................................... [INSERT 3 in 0.32s]
18:02:27  2 of 11 OK loaded seed file dbt_dev.top_level_email_domains .................... [INSERT 5 in 0.32s]
18:02:27  3 of 11 OK loaded seed file dbt_dev.wands ...................................... [INSERT 3 in 0.32s]
18:02:27  5 of 11 OK loaded seed file dbt_dev.worlds ..................................... [INSERT 6 in 0.32s]
18:02:27  4 of 11 OK loaded seed file dbt_dev.wizards .................................... [INSERT 6 in 0.32s]
18:02:27  6 of 11 START sql view model dbt_dev.stg_orders ................................ [RUN]
18:02:27  7 of 11 START sql view model dbt_dev.stg_wands ................................. [RUN]
18:02:27  8 of 11 START sql view model dbt_dev.stg_worlds ................................ [RUN]
18:02:27  9 of 11 START sql view model dbt_dev.stg_wizards ............................... [RUN]
18:02:28  8 of 11 OK created sql view model dbt_dev.stg_worlds ........................... [CREATE VIEW in 0.32s]
18:02:28  6 of 11 OK created sql view model dbt_dev.stg_orders ........................... [CREATE VIEW in 0.33s]
18:02:28  9 of 11 OK created sql view model dbt_dev.stg_wizards .......................... [CREATE VIEW in 0.35s]
18:02:28  7 of 11 OK created sql view model dbt_dev.stg_wands ............................ [CREATE VIEW in 0.35s]
18:02:28  11 of 11 START unit_test valid_email_address ................................... [RUN]
18:02:28  11 of 11 ERROR valid_email_address ............................................. [ERROR in 0.03s]
18:02:28  10 of 11 SKIP relation dbt_dev.dim_wizards ..................................... [SKIP]
18:02:28  12 of 11 SKIP relation dbt_dev.fct_orders ...................................... [SKIP]
18:02:28  
18:02:28  Finished running 5 seeds, 6 view models, 1 unit_test in 0 hours 0 minutes and 1.12 seconds (1.12s).
18:02:28  
18:02:28  Completed with 1 error and 0 warnings:
18:02:28  
18:02:28    Compilation Error in unit_test dim_wizards__valid_email_address (models/marts/unit_tests.yml)
  Unit_Test 'unit_test.coalesce_ci_demo_2023.dim_wizards.valid_email_address' (models/marts/unit_tests.yml) depends on a node named 'dim_wizards__valid_email_address__stg_worlds' which was not found
18:02:28  
18:02:28  Done. PASS=9 WARN=0 ERROR=1 SKIP=2 TOTAL=12

I think this error is because:

If I reference stg_worlds in my unit test definition, but don't supply any mock data...

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: cool@example.com,     email_top_level_domain: example.com}
          - {wizard_id: 2, email: cool@unknown.com,     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows: {}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I get a different error:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:06:22  Running with dbt=1.8.0-a1
18:06:22  Registered adapter: postgres=1.8.0-a1
18:06:23  Encountered an error:
Parsing Error
  Invalid unit_tests config given in FilePath(searched_path='models', relative_path='marts/unit_tests.yml', modification_time=1702490400.4283197, project_root='/Users/gracegoheen/dev/coalesce-ci-demo-2023') @ unit_tests: {'name': 'valid_email_address', 'model': 'dim_wizards', 'given': [{'input': "ref('stg_wizards')", 'rows': [{'wizard_id': 1, 'email': 'cool@example.com', 'email_top_level_domain': 'example.com'}, {'wizard_id': 2, 'email': 'cool@unknown.com', 'email_top_level_domain': 'unknown.com'}, {'wizard_id': 3, 'email': 'badgmail.com', 'email_top_level_domain': 'gmail.com'}, {'wizard_id': 4, 'email': 'missingdot@gmailcom', 'email_top_level_domain': 'gmail.com'}]}, {'input': "ref('top_level_email_domains')", 'rows': [{'tld': 'example.com'}, {'tld': 'gmail.com'}]}, {'input': "ref('stg_worlds')", 'rows': {}}], 'expect': {'rows': [{'wizard_id': 1, 'is_valid_email_address': True}, {'wizard_id': 2, 'is_valid_email_address': False}, {'wizard_id': 3, 'is_valid_email_address': False}, {'wizard_id': 4, 'is_valid_email_address': False}]}} - at path ['given'][2]['rows']: {} is not valid under any of the given schemas

If I don't include rows: at all...

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: cool@example.com,     email_top_level_domain: example.com}
          - {wizard_id: 2, email: cool@unknown.com,     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I get a different error:

(unit_test_before_models_core) ➜  coalesce-ci-demo-2023 git:(unit_testing_example) ✗ dbt build
18:06:38  Running with dbt=1.8.0-a1
18:06:38  Registered adapter: postgres=1.8.0-a1
18:06:39  Encountered an error:
Parsing Error
  Unable to find seed 'coalesce_ci_demo_2023.stg_worlds' for unit tests in directories: ['seeds']

Describe alternatives you've considered

The work-around I used was adding a dummy single row for stg_worlds:

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: cool@example.com,     email_top_level_domain: example.com}
          - {wizard_id: 2, email: cool@unknown.com,     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows:
          - {world_id: 1}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

I would like some way to be able to input no mock data for certain inputs.

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

graciegoheen commented 11 months ago

Did some more digging, and figured out the proper syntax!

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: cool@example.com,     email_top_level_domain: example.com}
          - {wizard_id: 2, email: cool@unknown.com,     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
      - input: ref('stg_worlds')
        rows: []
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}

@dbeatty10 Do we feel comfortable closing this out (and adding to our planned documentation) - or do you think there's any further improvement we could do here?

One enhancement I can think of, is not having to supply the input at all:

unit_tests:
  - name: a # this is the unique name of the test
    model: dim_wizards # name of the model I'm unit testing
    given: # the mock data for your inputs
      - input: ref('stg_wizards')
        rows:
          - {wizard_id: 1, email: cool@example.com,     email_top_level_domain: example.com}
          - {wizard_id: 2, email: cool@unknown.com,     email_top_level_domain: unknown.com}
          - {wizard_id: 3, email: badgmail.com,         email_top_level_domain: gmail.com}
          - {wizard_id: 4, email: missingdot@gmailcom,  email_top_level_domain: gmail.com}
      - input: ref('top_level_email_domains')
        rows:
          - {tld: example.com}
          - {tld: gmail.com}
    expect: # the expected output given the inputs above
      rows:
        - {wizard_id: 1, is_valid_email_address: true}
        - {wizard_id: 2, is_valid_email_address: false}
        - {wizard_id: 3, is_valid_email_address: false}
        - {wizard_id: 4, is_valid_email_address: false}
graciegoheen commented 8 months ago

Note: if you don't include a seed direct parent, we'll use the seed