dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.5k stars 1.58k forks source link

[Bug] Null unit test inputs raise error for non-nullable columns #10418

Closed smoothml closed 1 month ago

smoothml commented 1 month ago

Is this a new bug in dbt-core?

Current Behavior

When using Clickhouse it is common practice to not allows columns to be nullable so as to improve the efficiency of analytic-type queries. However, this is not compatible with how unit tests in DBT handle missing values and so results in an error such as the following:

DB::Exception: Cannot convert NULL to a non-nullable type: In scope SELECT CAST(1, 'UInt8') AS id, CAST(NULL, 'String') AS foo.

Expected Behavior

If a column is not-nullable and a value not provided in the test input, the values used should be the appropriate non-null default. For example, Clickhouse would typically use 0 instead of NULL for a non-nullable numeric column.

Steps To Reproduce

Consider the following example, based on the output of the dbt init command:

-- my_first_dbt_model.sql
{{ config(materialized='table') }}

select 1 as id, 'a' AS foo
union all
select 2 as id, 'b' AS foo

-- my_second_dbt_model.sql
select *
from {{ ref('my_first_dbt_model') }}
where id = 1

-- schema.yml
version: 2

models:
  - name: my_first_dbt_model
    description: "A starter dbt model"
    columns:
      - name: id
        data_type: uint64
      - name: foo
        data_type: string
  - name: my_second_dbt_model
    description: "A starter dbt model"
    columns:
      - name: id
        data_type: uint64
      - name: foo
        data_type: string
unit_tests:
  - name: test_not_null
    model: my_second_dbt_model
    given:
      - input: ref('my_first_dbt_model')
        rows:
          - {id: 1}
          - {id: 2}
    expect:
      rows:
        - {id: 1}

Values for the column foo are omitted in the tests, with only the id values provided. Running dbt build or dbt test results in the error.

Relevant log output

>>> poetry run dbt build
11:10:02  Running with dbt=1.8.3
11:10:03  Registered adapter: clickhouse=1.8.0
11:10:03  Found 2 models, 443 macros, 1 unit test
11:10:03
11:10:03  Concurrency: 1 threads (target='test')
11:10:03
11:10:03  1 of 3 START sql table model `dap_test`.`my_first_dbt_model` ................... [RUN]
11:10:03  1 of 3 OK created sql table model `dap_test`.`my_first_dbt_model` .............. [OK in 0.07s]
11:10:03  2 of 3 START unit_test my_second_dbt_model::test_not_null ...................... [RUN]
11:10:03  2 of 3 ERROR my_second_dbt_model::test_not_null ................................ [ERROR in 0.04s]
11:10:03  3 of 3 SKIP relation dap_test.my_second_dbt_model .............................. [SKIP]
11:10:03
11:10:03  Finished running 1 table model, 1 unit test, 1 view model in 0 hours 0 minutes and 0.27 seconds (0.27s).
11:10:03
11:10:03  Completed with 1 error and 0 warnings:
11:10:03
11:10:03    Runtime Error in unit_test test_not_null (models/example/schema.yml)
  An error occurred during execution of unit test 'test_not_null'. There may be an error in the unit test definition: check the data types.
   Database Error
    Code: 70.
    DB::Exception: Cannot convert NULL to a non-nullable type: In scope SELECT CAST(1, 'UInt8') AS id, CAST(NULL, 'String') AS foo. Stack trace:

    0. Poco::Exception::Exception(String const&, int) @ 0x000000010ef61640
    1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000010906f044
    2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x0000000104cb991c
    3. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x0000000104cc5178
    4. DB::(anonymous namespace)::FunctionCast::prepareUnpackDictionaries(std::shared_ptr<DB::IDataType const> const&, std::shared_ptr<DB::IDataType const> const&) const @ 0x000000010b9cb094
    5. DB::(anonymous namespace)::FunctionCast::prepare(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&) const @ 0x000000010b9c89e8
    6. DB::QueryAnalyzer::resolveFunction(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&) @ 0x000000010c68e090
    7. DB::QueryAnalyzer::resolveExpressionNode(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&, bool, bool, bool) @ 0x000000010c679854
    8. DB::QueryAnalyzer::resolveExpressionNodeList(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&, bool, bool) @ 0x000000010c678dd8
    9. DB::QueryAnalyzer::resolveProjectionExpressionNodeList(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&) @ 0x000000010c698918
    10. DB::QueryAnalyzer::resolveQuery(std::shared_ptr<DB::IQueryTreeNode> const&, DB::IdentifierResolveScope&) @ 0x000000010c674a04
    11. DB::QueryAnalyzer::resolveUnion(std::shared_ptr<DB::IQueryTreeNode> const&, DB::IdentifierResolveScope&) @ 0x000000010c676848
    12. DB::QueryAnalyzer::resolveExpressionNode(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&, bool, bool, bool) @ 0x000000010c679ae4
    13. DB::QueryAnalyzer::resolveQueryJoinTreeNode(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&, DB::QueryExpressionsAliasVisitor&) @ 0x000000010c69a694
    14. DB::QueryAnalyzer::resolveQuery(std::shared_ptr<DB::IQueryTreeNode> const&, DB::IdentifierResolveScope&) @ 0x000000010c6749c0
    15. DB::QueryAnalyzer::resolveExpressionNode(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&, bool, bool, bool) @ 0x000000010c679764
    16. DB::QueryAnalyzer::resolveQueryJoinTreeNode(std::shared_ptr<DB::IQueryTreeNode>&, DB::IdentifierResolveScope&, DB::QueryExpressionsAliasVisitor&) @ 0x000000010c69a694
    17. DB::QueryAnalyzer::resolveQuery(std::shared_ptr<DB::IQueryTreeNode> const&, DB::IdentifierResolveScope&) @ 0x000000010c6749c0
    18. DB::QueryAnalyzer::resolve(std::shared_ptr<DB::IQueryTreeNode>&, std::shared_ptr<DB::IQueryTreeNode> const&, std::shared_ptr<DB::Context const>) @ 0x000000010c673f88
    19. DB::QueryAnalysisPass::run(std::shared_ptr<DB::IQueryTreeNode>&, std::shared_ptr<DB::Context const>) @ 0x000000010c6738a0
    20. DB::QueryTreePassManager::run(std::shared_ptr<DB::IQueryTreeNode>) @ 0x000000010c5f47fc
    21. DB::(anonymous namespace)::buildQueryTreeAndRunPasses(std::shared_ptr<DB::IAST> const&, DB::SelectQueryOptions const&, std::shared_ptr<DB::Context const> const&, std::shared_ptr<DB::IStorage> const&) @ 0x000000010cd88d40
    22. DB::InterpreterSelectQueryAnalyzer::InterpreterSelectQueryAnalyzer(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x000000010cd86e58
    23. DB::InterpreterSelectQueryAnalyzer::getSampleBlock(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const> const&, DB::SelectQueryOptions const&) @ 0x000000010cd89160
    24. DB::InterpreterCreateQuery::getTablePropertiesAndNormalizeCreateQuery(DB::ASTCreateQuery&, DB::LoadingStrictnessLevel) const @ 0x000000010ccedbec
    25. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x000000010ccf2adc
    26. DB::InterpreterCreateQuery::execute() @ 0x000000010ccfc658
    27. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x000000010d0691b0
    28. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, DB::QueryFlags, DB::QueryProcessingStage::Enum) @ 0x000000010d065fd4
    29. DB::TCPHandler::runImpl() @ 0x000000010dd52f00
    30. DB::TCPHandler::run() @ 0x000000010dd68268
    31. Poco::Net::TCPServerConnection::start() @ 0x000000010f012378
11:10:03
11:10:03  Done. PASS=1 WARN=0 ERROR=1 SKIP=1 TOTAL=3

Environment

- OS: macOS
- Python: 3.12.3
- dbt: 1.8.3
- dbt-clickhouse: 1.8.0

Which database adapter are you using with dbt?

other (mention it in "Additional Context")

Additional Context

We are using Clickhouse.

I previously brought this up in the Slack community and was directed here to raise an issue.

dbeatty10 commented 1 month ago

Thanks for reaching out @smoothml !

It sounds to me like the root cause is that dbt-clickhouse doesn't yet support dbt unit tests rather than this being a bug in dbt-core. So I'm going to close this issue in favor of you opening an issue in dbt-clickhouse instead.

Specifically, it looks like dbt-clickhouse will need to override this line within get_empty_schema_sql:

      {{ cast('null', col['data_type']) }} as {{ col_name }}{{ ", " if not loop.last }}
smoothml commented 1 month ago

Thanks for getting back to me so quickly @dbeatty10. As you recommended, I have raised the issue with dbt-clickhouse.

steffen030 commented 1 month ago

Hello @dbeatty10 , I think the problem is rather in this line, where the unit test boldly assumes that null could be an appropriate value. However, for not nullable data it's obviously not a great choice of a default value. In it's current implementation, that piece is also hard to override by an adapter as it is not an adapter specific macro.

One might tamper with safe_cast macro, but if might interfere with a different context. It could make sense to create a separate adapter specific macro that allows the selection of an appropriate default value.

dbeatty10 commented 1 month ago

Thanks for identifying the relevant portion of the code @steffen030 ! 🤩

My belief is that null is the best value here for a couple reasons:

  1. Most databases support mandatory Feature ID E131 "Null value support (nulls in lieu of values)" from ANSI SQL by default (rather than requiring an opt-in)
  2. It's the one value that is valid across most data types

It looks like the crux of the issue is that ClickHouse has some restrictions / differences in behavior as it relates to E131 that is preventing null from being used in this way. That will naturally have some consequences for the adapter implementation for https://github.com/ClickHouse/dbt-clickhouse/issues/315, possibly including relatively complicated macro overrides.

If it turns out there is a macro abstraction that dbt should consider, we'd welcome a feature request.