Open markfickett opened 3 months ago
Thanks for reaching out about this @markfickett ! 💡
As you saw, dbt unit tests only support strict equality. We don't have any current plans to expand this to other types of assertions, so you'd need to continue to use one of the alternatives you mentioned.
If we did expand this beyond strict equality, it would be basically like being able to add multiple data tests to the actual output of the unit test rather than just a single expect
ed output. A YAML spec incorporating a data test option might look something like this:
unit_tests:
- name: test_is_valid_email_address
given:
- input: ref('top_level_email_domains')
rows:
- {tld: example.com}
- {tld: gmail.com}
model:
- name: dim_customers
data_tests:
- dbt_expectations.expect_table_row_count_to_equal
value: 2
columns:
- name: email
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
But since we do not have any current plans to do something like that, I'm going to close this as "not planned".
We're having floating point errors even when using numeric (albeit without specifying length). Explicitly setting the number of 0 still triggers as failed. Is there a way to do it besides rounding the output of the model?
actual differs from expected:
@@,_surrogate_key,resultat
→ ,deadbeef ,300.0→300.00000000000000000000
@polmonso which dbt adapter are you using?
Do you happen to know the exact data type, precision, and scale that your model is writing to the database for the resultat
column?
Postgres We're using unconstrained numeric as a type. I presume we'll have to explicitly state the precision and scale for the unit test comparison to work, since it can't know beforehand how many digits are required. @dbeatty10
This issue makes DBT unit tests useless for any table/column with floats. It's important to have a solution here! (We're on bigquery fwiw)
We see this literally half the time for float columns in bigquery with very simple math - here avg(x) is the average of 0.7 and 0.75. I've had to conditionally mask this avg
with a round
to get this test to pass
@jasonkb or @mchonofsky do either of you have a simple example that highlights this in BigQuery?
I tried the following files, and it it worked for me:
models/my_model.sql
select 1 as id, avg(x) as score
from unnest([
cast(0.7 as float64),
cast(0.75 as float64)
]) as x
models/_unit_tests.yml
unit_tests:
- name: dbt_core_10470
model: my_model
given: []
expect:
rows:
- {score: 0.725}
Run this command:
dbt build -s models/my_model.sql
Got this output:
$ dbt build -s models/my_model.sql
01:37:50 Running with dbt=1.8.6
01:37:51 Registered adapter: bigquery=1.8.2
01:37:52 Found 1 model, 473 macros, 1 unit test
01:37:52
01:38:16 Concurrency: 10 threads (target='dev')
01:38:16
01:38:16 1 of 2 START unit_test my_model::dbt_core_10470 ................................ [RUN]
01:38:22 1 of 2 PASS my_model::dbt_core_10470 ........................................... [PASS in 5.82s]
01:38:22 2 of 2 START sql view model dbt_dbeatty.my_model ............................... [RUN]
01:38:23 2 of 2 OK created sql view model dbt_dbeatty.my_model .......................... [CREATE VIEW (0 processed) in 1.85s]
01:38:23
01:38:23 Finished running 1 unit test, 1 view model in 0 hours 0 minutes and 31.81 seconds (31.81s).
01:38:24
01:38:24 Completed successfully
01:38:24
01:38:24 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
@polmonso I see what you are saying about unconstrained numeric in Postgres.
If I use the following files, I get the "actual differs from expected" like you did:
models/my_model.sql
with
numbers as (
select cast(0.7 as numeric) as x union all
select cast(0.75 as numeric) as x
)
select avg(x) as score
from numbers
models/_unit_tests.yml
unit_tests:
- name: dbt_core_10470
model: my_model
given: []
expect:
rows:
- {score: 0.725}
But if I change the unit test to the following, then it works (note the quotes that treat it as a string!):
unit_tests:
- name: dbt_core_10470
model: my_model
given: []
expect:
rows:
- {score: "0.72500000000000000000"}
Hey @dbeatty10 that's pretty close to what I would try to construct were I to try to give you some sample code. For me some of the frustration is that it happens stochastically and I can't always get it to fail. But it looks like https://github.com/dbt-labs/dbt-core/issues/9884 will resolve the issue! Is that your read? I'll poke around in our bigquery instance/dbt repo and see if I can get you some sample code that reliably fails for me.
@mchonofsky Initially, I thought https://github.com/dbt-labs/dbt-core/issues/9884 would resolve this too! But I think was just a replacement for https://github.com/dbt-labs/dbt-core/issues/9627. The issue there was that an alpha version of our implementation would truncate numeric values before comparison.
Totally understand the frustration related to the stochastic behavior. Searching for the word "deterministic" in the BigQuery docs relating to float
here and here describes a couple caveats. Do you think that could be related?
Is this your first time submitting a feature request?
Describe the feature
When writing expected values in unit tests, some values I'd like to validate are floats. Like other test frameworks, I'd like to have the option to allow some wiggle room with float comparison.
Describe alternatives you've considered
It seems like I can put the full printed precision in my expected values (ex:
0.0005137135810924898
), however this is verbose, I think it will be brittle, and also doesn't really represent what I want to require of the output.Who will this benefit?
Anyone using unit tests and processing numeric data.
Are you interested in contributing this feature?
No response
Anything else?
This could be similar to pytest.approx or the
rtol
andatol
arguments to pandas.testing.assert_frame_equal.