Open adamcunnington-mlg opened 1 year ago
I'm not sure whether a nested data test wrapper is also possible to produce - in a generic fashion - for table-level tests - but we're also exploring this.
Regardless, this FR currently focusses on column-level functionality - which is very doable - and I think very sensible!
Here's our wrapper:
{%- test nested_data_test_wrapper(model, column_name, test) -%}
{%- set match = modules.re.match("^(?:.+\s)?`?(?P<database>.+?)`?\.`?(?P<schema>.+?)`?\.`?(?P<table>.+?)`?(?:\s.+)?$",
model|string) -%}
{%- set relation = adapter.get_relation(match.database, match.schema, match.table) -%}
{%- set column_name_hierarchy = column_name.split(".")|map("trim", "`")|list -%}
{%- set columns = adapter.get_columns_in_relation(relation) -%}
{%- set parent_column = columns|selectattr("name", "equalto", column_name_hierarchy[0])|list -%}
{%- set ns = namespace(ancestor_columns=[], unnest_alias="",
from_sql="FROM (SELECT ROW_NUMBER() OVER () AS row_number, * FROM %s)" % model) -%}
{%- for child_column in parent_column recursive -%}
{%- set ns.child_column_name = child_column.quoted -%}
{%- set ns.ancestor_columns = ns.ancestor_columns + [ns.child_column_name] -%}
{%- set ns.unnest_alias = ns.ancestor_columns|join(".")|string -%}
{%- if child_column.mode == "REPEATED" -%}
{%- set ns.from_sql = ns.from_sql + " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_alias, ns.child_column_name) -%}
{%- set ns.ancestor_columns = [ns.child_column_name] -%}
{%- endif -%}
{%- if child_column.fields -%}
{{ loop(child_column.fields|selectattr("name", "equalto", column_name_hierarchy[loop.depth])|list) }}
{%- endif -%}
{%- endfor -%}
{%- set model = "(SELECT row_number, %s AS %s %s)" % (ns.unnest_alias, ns.child_column_name, ns.from_sql) -%}
{%- if test == "accepted_values" -%} -- column-level test
{{ test_accepted_values(model, ns.child_column_name, kwargs.get("values"), kwargs.get("quote", True)) }}
{%- elif test == "dbt_expectations.expect_column_values_to_be_in_set" -%} -- column-level test
{{ dbt_expectations.test_expect_column_values_to_be_in_set(model, ns.child_column_name, kwargs.get("value_set"),
quote_values=kwargs.get("quote_values"),
row_condition=kwargs.get("row_condition")) }}
{%- elif test == "dbt_expectations.expect_column_values_to_be_null" -%} -- column-level test
{{ dbt_expectations.test_expect_column_values_to_be_null(model, ns.child_column_name,
row_condition=kwargs.get("row_condition")) }}
{%- elif test == "dbt_expectations.expect_column_values_to_match_regex" -%} -- column-level test
{{ dbt_expectations.test_expect_column_values_to_match_regex(model, ns.child_column_name, kwargs.get("regex"),
row_condition=kwargs.get("row_condition"),
is_raw=kwargs.get("is_raw", False),
flags=kwargs.get("flags", "")) }}
{%- elif test == "dbt_utils.expression_is_true" -%} -- column-level test
{{ dbt_utils.test_expression_is_true(model, kwargs.get("expression"), ns.child_column_name) }}
{%- elif test == "dbt_utils.mutually_exclusive_ranges" -%} -- model-level test
{%- set partition_by = ["row_number"] -%}
{%- for partition_by_column_name in kwargs.get("partition_by", []) -%}
{{ partition_by.append("%s.%s" % ns.child_column_name, partition_by_column_name) }}
{%- endfor -%}
{{ dbt_utils.test_mutually_exclusive_ranges(model, lower_bound_column=kwargs.get("lower_bound_column"),
upper_bound_column=kwargs.get("upper_bound_column"),
partition_by=partition_by.join(","),
gaps=kwargs.get("gaps", "allowed"),
zero_length_range_allowed=kwargs.get("zero_length_range_allowed", False)) }}
{%- elif test == "dbt_utils.unique_combination_of_columns" -%} -- model-level test
{%- set combination_of_columns = ["row_number"] -%}
{%- for combination_of_columns_column_name in kwargs.get("combination_of_columns", []) -%}
{{ combination_of_columns.append("%s.%s" % ns.child_column_name, combination_of_columns_column_name) }}
{%- endfor -%}
{{ dbt_utils.test_unique_combination_of_columns(model, combination_of_columns,
quote_columns=kwargs.get("quote_columns", False)) }}
{%- elif test == "not_null" -%} -- column-level test
{{ test_not_null(model, ns.child_column_name) }}
{%- elif test == "relationships" -%} -- column-level test
{{ test_relationships(model, ns.child_column_name, kwargs.get("to"), kwargs.get("field")) }}
{%- elif test == "unique" -%} -- column-level test
{{ test_unique(model, ns.child_column_name) }}
{%- endif -%}
{%- endtest -%}
and example usage:
columns:
- name: foo.bar.baz
tests:
- nested_data_test_wrapper:
config:
where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
test: accepted_values
values:
- value1
- value2
- value3
- nested_data_test_wrapper:
test: unique
Heh, no one is interested in this?
@adamcunnington-mlg Sorry for the delay - I'm not uninterested :)
dbt-core
doesn't really know anything about nested fields right know. dbt-bigquery
knows a little bit (more thanks to https://github.com/dbt-labs/dbt-bigquery/pull/738), but still not very much. As you say, it's worked out nicely that BigQuery lets you:
foo.bar
foo.bar
, if foo
is a struct — but not, as you rightly point out, if foo
is an arrayWe have some open issues (related to our work on "model contracts") to better handle when foo
is an array, and to teach dbt-core
more about nested fields (by upstreaming the logic from dbt-bigquery
):
Your proposal strikes me as totally reasonable:
This can be achieved by having the test wrap do nothing more than setting the model variable to be a sub-query. dbt actually does the exact same thing when you provide config/where. This way, the nested data test can be a totally generic wrapper, constructing a single column of data and then calling the relevant test.
Considerations:
--store-failures
. I agree that feels like a reasonable tradeoff. For the purposes of testing, we treat the repeated record as if it were its own standalone table.[cross join] unnest
; for SparkSQL it's expode
; I'm sure there are more. FWIW, I don't think we can support Snowflake's lateral flatten
here, since it does not have a real concept of structured (typed) nested fields—only variant
types (JSON strings) where anything goes.dbt-core
, we'd need to special case .
as an indicator of field nesting. That's true on BigQuery/Spark/etc, but it doesn't have that denotation on Postgres/Redshift/Snowflake, and we'd risk doing the wrong thing with someone's "weird.Column.Naming.Convention"
.As far as the implementation - it looks like you took inspiration from JT's excellent discourse post several years ago. I think the goal would be to:
get_where_subquery
is called — where the goal would be to interpolate the subquery and modify the column_name
accordinglyWhile this is a neat problem to think through, I don't see it as very high priority for us to implement. If a member of the community wants to take a crack at this, I'd be all for it.
@jtcohen6 thanks for the super-considered response as always! It makes complete sense and sounds like the right implementation route.
I agree on this not being priority given there is a workaround and perhaps nested types are still not used as much as they should and thus is impacting the community less than it ought!
Could you comment on how table-level tests might work? Would it be that once dbt-core is aware of nesting, table-level tests could simply have reference columns within arrays and dbt-core would handle the column-level sub-querying? This feels more complex when a table-level test references multiple columns that are within different arrays / depths. This is probably invalid (logically) but something would still need to "handle" this.
The one thing i did want to call out though is that foo.bar.baz
syntax is convenient for referencing columns within a repeated column (array) but you could argue it's unfortunate and ambiguous vs. struct/nested references. Some jsonPath implementations would prefer something like foo[*].bar to be explicit that foo is an array, not a struct. Indeed, it's slightly strange that I can annotate both nested and array columns with foo.bar
but I understand why this works thanks to the adapter-level magic. What do you think dbt-core should expect in the future?
Here's our wrapper:
{%- test nested_data_test_wrapper(model, column_name, test, from_model=None) -%} {%- set ns = namespace(field_selector_part="", column_list=[], unnest_identifier="") -%} {%- set ns.from_sql = "FROM %s" % model -%} {%- set hierarchy = column_name.split(".") -%} {%- set columns = adapter.get_columns_in_relation(model if from_model == None else from_model) -%} {%- set root = columns | selectattr("name", "equalto", hierarchy[0]) | list -%} {%- for child in root recursive -%} {%- set child_name = "`" ~ child.name ~ "`" -%} {%- set ns.column_list = ns.column_list + [child_name] -%} {%- set ns.unnest_identifier = ns.column_list | join(".") | string -%} {%- if child.mode == "REPEATED" -%} {%- set ns.from_sql = ns.from_sql ~ " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_identifier, child_name) -%} {%- set ns.column_list = [child_name] -%} {%- endif -%} {%- if child.fields -%} {{ loop(child.fields | selectattr("name", "equalto", hierarchy[loop.depth]) | list) }} {%- endif -%} {%- endfor -%} {%- set model = "(SELECT %s AS `column` %s)" % (ns.unnest_identifier, ns.from_sql) -%} {%- if test == "accepted_values" -%} {{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}} {%- elif test == "not_null" -%} {{ test_not_null(model, "column") }} {%- elif test == "relationships" -%} {{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }} {%- elif test == "unique" -%} {{ test_unique(model, "column") }} {%- endif -%} {%- endtest -%}
and example usage:
columns: - name: foo.bar.baz tests: - nested_data_test_wrapper: config: where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works test: accepted_values values: - value1 - value2 - value3 - nested_data_test_wrapper: test: unique
@adamcunnington-mlg thank you very much for this proposition. I'm stealing your code because it's too good ! Great job 👌 🙏
Here's our wrapper:
{%- test nested_data_test_wrapper(model, column_name, test, from_model=None) -%} {%- set ns = namespace(field_selector_part="", column_list=[], unnest_identifier="") -%} {%- set ns.from_sql = "FROM %s" % model -%} {%- set hierarchy = column_name.split(".") -%} {%- set columns = adapter.get_columns_in_relation(model if from_model == None else from_model) -%} {%- set root = columns | selectattr("name", "equalto", hierarchy[0]) | list -%} {%- for child in root recursive -%} {%- set child_name = "`" ~ child.name ~ "`" -%} {%- set ns.column_list = ns.column_list + [child_name] -%} {%- set ns.unnest_identifier = ns.column_list | join(".") | string -%} {%- if child.mode == "REPEATED" -%} {%- set ns.from_sql = ns.from_sql ~ " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_identifier, child_name) -%} {%- set ns.column_list = [child_name] -%} {%- endif -%} {%- if child.fields -%} {{ loop(child.fields | selectattr("name", "equalto", hierarchy[loop.depth]) | list) }} {%- endif -%} {%- endfor -%} {%- set model = "(SELECT %s AS `column` %s)" % (ns.unnest_identifier, ns.from_sql) -%} {%- if test == "accepted_values" -%} {{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}} {%- elif test == "not_null" -%} {{ test_not_null(model, "column") }} {%- elif test == "relationships" -%} {{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }} {%- elif test == "unique" -%} {{ test_unique(model, "column") }} {%- endif -%} {%- endtest -%}
and example usage:
columns: - name: foo.bar.baz tests: - nested_data_test_wrapper: config: where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works test: accepted_values values: - value1 - value2 - value3 - nested_data_test_wrapper: test: unique
This is amazing - Thank you very much, this saved my day! 🙏
P.s. I have updated the test wrapper code above as there are a few bug fixes and additional tests added since
plz this need to be fixed, already spent a day, figuring out what the issue was, who can fix this issue with DBT ? @adamcunnington-mlg
is there any way around this, a quick fix ? @adamcunnington-mlg
@msiddiq1400 what issue are you talking about - you've not specified.
This feature seems to have already been implemented.
@vittorfp really? where - can you share link to updated docs? was this in 1.8/1.9?
Is this your first time submitting a feature request?
Describe the feature
Context
When adding column metadata, for example, to a model YML file, it is already possible to reference a nested column in order to add a description. E.g.
Presumably, DBT is doing something clever and traversing dot notation. This works today and if configured as such and supported by the adapter, this description will be written back to the database. Notably, this works regardless of whether foo is a struct or an array (using BQ terminology here but the same concepts exist in other DBs). I assume this works regardless of depth too but I've not explicitly verified this. Generic tests will even work if foo is a struct (again, presumably with arbitrary depth).
However, generic tests won't work in the case of a repeated/array field - and this is not surprising - different SQL would be needed to unnest repeated fields. E.g. this will cause an error during the not_null test execution.
Someone from the community came up with an implementation for a nested not null test a while ago. It makes for some good inspiration but it has fallen short of what it should have been! The nested data logic should not be limited to not_null - it should be a generic wrapper test that calls out to some other generic test. This can be achieved by having the test wrap do nothing more than setting the
model
variable to be a sub-query. DBT actually does the exact same thing when you provideconfig/where
. This way, the nested data test can be a totally generic wrapper, constructing a single column of data and then calling the relevant test. The only limitation is not all columns can be passed to the downstream test which the downstream test would ideally like to have in the case ofstore_failures = True
but this is a small (and reasonable limitation).I have an alternative implementation to this test that works in a generic way and we are using it for several downstream generic tests. The only limitation is that the test has to have some logic like the below which explicitly calls the relevant test:
As far as I can tell, within a jinja-only context, this can't be gotten around for two reasons:
test
variable cannot be used to dynamically determine the macro to call - jinja's not quite object-oriented enough for that.The Feature Request
DBT core should do the SQL wrapper before it calls an existing generic test. It should do something similar to what my jinja is doing, but far easier in python, and basically parse the
foo.bar.baz
column name, ask the adapter for the column metadata and then traverse it and build the subquery before calling the test. It's pretty straight forward stuff - just a case of addingCROSS JOIN UNNEST(<x>) AS <y>
whenever it encounters a repeated field. There's slightly more to it than that but the community post does a great job of explaining the logic and I'm happy to provide our improvement too.Describe alternatives you've considered
Doing it myself via a custom generic test that is a clever wrapper - but as described in main part of the post, has some unavoidable limitations - as well as being a missed opportunity to benefit the rest of the community.
Who will this benefit?
Everyone that wants to test nested data. Huge impact!
Are you interested in contributing this feature?
Not familiar with dbt-core - hopefully someone can take the logic and just implement it in the right place, having thought through any implications.
Anything else?
No response