dbt-labs / spark-utils

Utility functions for dbt projects running on Spark
https://hub.getdbt.com/fishtown-analytics/spark_utils/latest/
Apache License 2.0
30 stars 15 forks source link

New `assert_true()` breaks when timestamps are NULL #12

Closed foundinblank closed 3 years ago

foundinblank commented 3 years ago

So this PR to assert the existence of timestamps is breaking one of the models in the Segment dbt package.

The segment_web_page_sessionized model runs a lag() on timestamps and then uses dbt_utils.date_diff() to calculate the time difference between a timestamp and its lagged-by-one-row counterpart. Obviously, the first row will always have a NULL previous_tstamp in it, causing assert_true() to fail.

This also diverges from how the default dbt_utils.datediff() macro works (which doesn't any assertion).

What do you think would be the best way to fix it? We could modify the assert_true() macro to skip the assertion if the field is NULL

foundinblank commented 3 years ago

Would something like this be a good fix?

{% macro assert_not_null(function, arg) %}

    nvl2({{function}}({{arg}}), assert_true({{function}}({{arg}}) is not null), null) 

{% endmacro %}

So if {{function}}({{arg}}) is not null, go ahead and assert it, if it's not, return null?

jtcohen6 commented 3 years ago

@foundinblank Ah thanks for raising this! You're right, this does diverge from the standard dbt_utils.datediff() behavior. The motivation here was that SparkSQL was giving an especially unhelpful error if you passed a nonexistent date (e.g. 2021-02-30).

I think your proposal is a good one—let's only assert assert not-null if the value is not explicitly null. Would you be up for opening a PR with that fix?