dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.86k stars 1.63k forks source link

[CT-2573] [Feature] Support column-level tests on nested data, natively #7613

Open adamcunnington-mlg opened 1 year ago

adamcunnington-mlg commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Context

When adding column metadata, for example, to a model YML file, it is already possible to reference a nested column in order to add a description. E.g.

columns:
- name: foo.bar
  description: baz

Presumably, DBT is doing something clever and traversing dot notation. This works today and if configured as such and supported by the adapter, this description will be written back to the database. Notably, this works regardless of whether foo is a struct or an array (using BQ terminology here but the same concepts exist in other DBs). I assume this works regardless of depth too but I've not explicitly verified this. Generic tests will even work if foo is a struct (again, presumably with arbitrary depth).

However, generic tests won't work in the case of a repeated/array field - and this is not surprising - different SQL would be needed to unnest repeated fields. E.g. this will cause an error during the not_null test execution.

  - name: foo.bar  # foo is an array
    tests:
    - not_null

Someone from the community came up with an implementation for a nested not null test a while ago. It makes for some good inspiration but it has fallen short of what it should have been! The nested data logic should not be limited to not_null - it should be a generic wrapper test that calls out to some other generic test. This can be achieved by having the test wrap do nothing more than setting the model variable to be a sub-query. DBT actually does the exact same thing when you provide config/where. This way, the nested data test can be a totally generic wrapper, constructing a single column of data and then calling the relevant test. The only limitation is not all columns can be passed to the downstream test which the downstream test would ideally like to have in the case of store_failures = True but this is a small (and reasonable limitation).

I have an alternative implementation to this test that works in a generic way and we are using it for several downstream generic tests. The only limitation is that the test has to have some logic like the below which explicitly calls the relevant test:

  {% if test == "accepted_values" %}
{{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}}
  {% elif test == "not_null" -%}
{{ test_not_null(model, "column") }}
  {% elif test == "relationships" %}
{{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }}
  {% elif test == "unique" %}
{{ test_unique(model, "column") }}
  {%- endif %}

As far as I can tell, within a jinja-only context, this can't be gotten around for two reasons:

  1. Jinja doesn't allow you to pass/expand/unpack kwargs - e.g. some_test(**kwargs)
  2. My test variable cannot be used to dynamically determine the macro to call - jinja's not quite object-oriented enough for that.

The Feature Request

DBT core should do the SQL wrapper before it calls an existing generic test. It should do something similar to what my jinja is doing, but far easier in python, and basically parse the foo.bar.baz column name, ask the adapter for the column metadata and then traverse it and build the subquery before calling the test. It's pretty straight forward stuff - just a case of adding CROSS JOIN UNNEST(<x>) AS <y> whenever it encounters a repeated field. There's slightly more to it than that but the community post does a great job of explaining the logic and I'm happy to provide our improvement too.

Describe alternatives you've considered

Doing it myself via a custom generic test that is a clever wrapper - but as described in main part of the post, has some unavoidable limitations - as well as being a missed opportunity to benefit the rest of the community.

Who will this benefit?

Everyone that wants to test nested data. Huge impact!

Are you interested in contributing this feature?

Not familiar with dbt-core - hopefully someone can take the logic and just implement it in the right place, having thought through any implications.

Anything else?

No response

adamcunnington-mlg commented 1 year ago

I'm not sure whether a nested data test wrapper is also possible to produce - in a generic fashion - for table-level tests - but we're also exploring this.

Regardless, this FR currently focusses on column-level functionality - which is very doable - and I think very sensible!

adamcunnington-mlg commented 1 year ago

Here's our wrapper:

{%- test nested_data_test_wrapper(model, column_name, test) -%}
  {%- set match = modules.re.match("^(?:.+\s)?`?(?P<database>.+?)`?\.`?(?P<schema>.+?)`?\.`?(?P<table>.+?)`?(?:\s.+)?$",
                                   model|string) -%}
  {%- set relation = adapter.get_relation(match.database, match.schema, match.table) -%}
  {%- set column_name_hierarchy = column_name.split(".")|map("trim", "`")|list -%}
  {%- set columns = adapter.get_columns_in_relation(relation) -%}
  {%- set parent_column = columns|selectattr("name", "equalto", column_name_hierarchy[0])|list -%}
  {%- set ns = namespace(ancestor_columns=[], unnest_alias="",
                         from_sql="FROM (SELECT ROW_NUMBER() OVER () AS row_number, * FROM %s)" % model) -%}
  {%- for child_column in parent_column recursive -%}
    {%- set ns.child_column_name = child_column.quoted -%}
    {%- set ns.ancestor_columns = ns.ancestor_columns + [ns.child_column_name] -%}
    {%- set ns.unnest_alias = ns.ancestor_columns|join(".")|string -%}
    {%- if child_column.mode == "REPEATED" -%}
      {%- set ns.from_sql = ns.from_sql + " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_alias, ns.child_column_name) -%}
      {%- set ns.ancestor_columns = [ns.child_column_name] -%}
    {%- endif -%}
    {%- if child_column.fields -%}
      {{ loop(child_column.fields|selectattr("name", "equalto", column_name_hierarchy[loop.depth])|list) }}
    {%- endif -%}
  {%- endfor -%}
  {%- set model = "(SELECT row_number, %s AS %s %s)" % (ns.unnest_alias, ns.child_column_name, ns.from_sql) -%}
  {%- if test == "accepted_values" -%}  -- column-level test
{{ test_accepted_values(model, ns.child_column_name, kwargs.get("values"), kwargs.get("quote", True)) }}
  {%- elif test == "dbt_expectations.expect_column_values_to_be_in_set" -%}  -- column-level test
{{ dbt_expectations.test_expect_column_values_to_be_in_set(model, ns.child_column_name, kwargs.get("value_set"),
                                                           quote_values=kwargs.get("quote_values"),
                                                           row_condition=kwargs.get("row_condition")) }}
  {%- elif test == "dbt_expectations.expect_column_values_to_be_null" -%}  -- column-level test
{{ dbt_expectations.test_expect_column_values_to_be_null(model, ns.child_column_name,
                                                         row_condition=kwargs.get("row_condition")) }}
  {%- elif test == "dbt_expectations.expect_column_values_to_match_regex" -%}  -- column-level test
{{ dbt_expectations.test_expect_column_values_to_match_regex(model, ns.child_column_name, kwargs.get("regex"),
                                                             row_condition=kwargs.get("row_condition"),
                                                             is_raw=kwargs.get("is_raw", False),
                                                             flags=kwargs.get("flags", "")) }}
  {%- elif test == "dbt_utils.expression_is_true" -%}  -- column-level test
{{ dbt_utils.test_expression_is_true(model, kwargs.get("expression"), ns.child_column_name) }}
  {%- elif test == "dbt_utils.mutually_exclusive_ranges" -%}  -- model-level test
    {%- set partition_by = ["row_number"] -%}
    {%- for partition_by_column_name in kwargs.get("partition_by", []) -%}
      {{ partition_by.append("%s.%s" % ns.child_column_name, partition_by_column_name) }}
    {%- endfor -%}
{{ dbt_utils.test_mutually_exclusive_ranges(model, lower_bound_column=kwargs.get("lower_bound_column"),
                                            upper_bound_column=kwargs.get("upper_bound_column"),
                                            partition_by=partition_by.join(","),
                                            gaps=kwargs.get("gaps", "allowed"),
                                            zero_length_range_allowed=kwargs.get("zero_length_range_allowed", False)) }}
  {%- elif test == "dbt_utils.unique_combination_of_columns" -%}  -- model-level test
    {%- set combination_of_columns = ["row_number"] -%}
    {%- for combination_of_columns_column_name in kwargs.get("combination_of_columns", []) -%}
      {{ combination_of_columns.append("%s.%s" % ns.child_column_name, combination_of_columns_column_name) }}
    {%- endfor -%}
{{ dbt_utils.test_unique_combination_of_columns(model, combination_of_columns,
                                                quote_columns=kwargs.get("quote_columns", False)) }}
  {%- elif test == "not_null" -%}  -- column-level test
{{ test_not_null(model, ns.child_column_name) }}
  {%- elif test == "relationships" -%}  -- column-level test
{{ test_relationships(model, ns.child_column_name, kwargs.get("to"), kwargs.get("field")) }}
  {%- elif test == "unique" -%}  -- column-level test
{{ test_unique(model, ns.child_column_name) }}
  {%- endif -%}
{%- endtest -%}

and example usage:

columns:
- name: foo.bar.baz
  tests:
  - nested_data_test_wrapper:
      config:
        where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
      test: accepted_values
      values:
      - value1
      - value2
      - value3
  - nested_data_test_wrapper:
      test: unique
adamcunnington-mlg commented 1 year ago

Heh, no one is interested in this?

jtcohen6 commented 1 year ago

@adamcunnington-mlg Sorry for the delay - I'm not uninterested :)

dbt-core doesn't really know anything about nested fields right know. dbt-bigquery knows a little bit (more thanks to https://github.com/dbt-labs/dbt-bigquery/pull/738), but still not very much. As you say, it's worked out nicely that BigQuery lets you:

We have some open issues (related to our work on "model contracts") to better handle when foo is an array, and to teach dbt-core more about nested fields (by upstreaming the logic from dbt-bigquery):

Your proposal strikes me as totally reasonable:

This can be achieved by having the test wrap do nothing more than setting the model variable to be a sub-query. dbt actually does the exact same thing when you provide config/where. This way, the nested data test can be a totally generic wrapper, constructing a single column of data and then calling the relevant test.

Considerations:

As far as the implementation - it looks like you took inspiration from JT's excellent discourse post several years ago. I think the goal would be to:

While this is a neat problem to think through, I don't see it as very high priority for us to implement. If a member of the community wants to take a crack at this, I'd be all for it.

adamcunnington-mlg commented 1 year ago

@jtcohen6 thanks for the super-considered response as always! It makes complete sense and sounds like the right implementation route.

I agree on this not being priority given there is a workaround and perhaps nested types are still not used as much as they should and thus is impacting the community less than it ought!

Could you comment on how table-level tests might work? Would it be that once dbt-core is aware of nesting, table-level tests could simply have reference columns within arrays and dbt-core would handle the column-level sub-querying? This feels more complex when a table-level test references multiple columns that are within different arrays / depths. This is probably invalid (logically) but something would still need to "handle" this.

The one thing i did want to call out though is that foo.bar.baz syntax is convenient for referencing columns within a repeated column (array) but you could argue it's unfortunate and ambiguous vs. struct/nested references. Some jsonPath implementations would prefer something like foo[*].bar to be explicit that foo is an array, not a struct. Indeed, it's slightly strange that I can annotate both nested and array columns with foo.bar but I understand why this works thanks to the adapter-level magic. What do you think dbt-core should expect in the future?

hugohahn009 commented 1 year ago

Here's our wrapper:

{%- test nested_data_test_wrapper(model, column_name, test, from_model=None) -%}
  {%- set ns = namespace(field_selector_part="", column_list=[], unnest_identifier="") -%}
  {%- set ns.from_sql = "FROM %s" % model -%}
  {%- set hierarchy = column_name.split(".") -%}
  {%- set columns = adapter.get_columns_in_relation(model if from_model == None else from_model) -%}
  {%- set root = columns | selectattr("name", "equalto", hierarchy[0]) | list -%}
  {%- for child in root recursive -%}
    {%- set child_name = "`" ~ child.name ~ "`" -%}
    {%- set ns.column_list = ns.column_list + [child_name] -%}
    {%- set ns.unnest_identifier = ns.column_list | join(".") | string -%}
    {%- if child.mode == "REPEATED" -%}
      {%- set ns.from_sql = ns.from_sql ~ " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_identifier, child_name) -%}
      {%- set ns.column_list = [child_name] -%}
    {%- endif -%}
    {%- if child.fields -%}
      {{ loop(child.fields | selectattr("name", "equalto", hierarchy[loop.depth]) | list) }}
    {%- endif -%}
  {%- endfor -%}
  {%- set model = "(SELECT %s AS `column` %s)" % (ns.unnest_identifier, ns.from_sql) -%}
  {%- if test == "accepted_values" -%}
{{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}}
  {%- elif test == "not_null" -%}
{{ test_not_null(model, "column") }}
  {%- elif test == "relationships" -%}
{{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }}
  {%- elif test == "unique" -%}
{{ test_unique(model, "column") }}
  {%- endif -%}
{%- endtest -%}

and example usage:

columns:
- name: foo.bar.baz
  tests:
  - nested_data_test_wrapper:
      config:
        where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
      test: accepted_values
      values:
      - value1
      - value2
      - value3
  - nested_data_test_wrapper:
      test: unique

@adamcunnington-mlg thank you very much for this proposition. I'm stealing your code because it's too good ! Great job 👌 🙏

Jeniffen commented 1 year ago

Here's our wrapper:

{%- test nested_data_test_wrapper(model, column_name, test, from_model=None) -%}
  {%- set ns = namespace(field_selector_part="", column_list=[], unnest_identifier="") -%}
  {%- set ns.from_sql = "FROM %s" % model -%}
  {%- set hierarchy = column_name.split(".") -%}
  {%- set columns = adapter.get_columns_in_relation(model if from_model == None else from_model) -%}
  {%- set root = columns | selectattr("name", "equalto", hierarchy[0]) | list -%}
  {%- for child in root recursive -%}
    {%- set child_name = "`" ~ child.name ~ "`" -%}
    {%- set ns.column_list = ns.column_list + [child_name] -%}
    {%- set ns.unnest_identifier = ns.column_list | join(".") | string -%}
    {%- if child.mode == "REPEATED" -%}
      {%- set ns.from_sql = ns.from_sql ~ " CROSS JOIN UNNEST (%s) AS %s" % (ns.unnest_identifier, child_name) -%}
      {%- set ns.column_list = [child_name] -%}
    {%- endif -%}
    {%- if child.fields -%}
      {{ loop(child.fields | selectattr("name", "equalto", hierarchy[loop.depth]) | list) }}
    {%- endif -%}
  {%- endfor -%}
  {%- set model = "(SELECT %s AS `column` %s)" % (ns.unnest_identifier, ns.from_sql) -%}
  {%- if test == "accepted_values" -%}
{{ test_accepted_values(model, "column", kwargs.get("values"), kwargs.get("quote", True))}}
  {%- elif test == "not_null" -%}
{{ test_not_null(model, "column") }}
  {%- elif test == "relationships" -%}
{{ test_relationships(model, "column", kwargs.get("to"), kwargs.get("field")) }}
  {%- elif test == "unique" -%}
{{ test_unique(model, "column") }}
  {%- endif -%}
{%- endtest -%}

and example usage:

columns:
- name: foo.bar.baz
  tests:
  - nested_data_test_wrapper:
      config:
        where: "`some_other_column` IS NOT NULL" -- for example, just showing this still works
      test: accepted_values
      values:
      - value1
      - value2
      - value3
  - nested_data_test_wrapper:
      test: unique

This is amazing - Thank you very much, this saved my day! 🙏

adamcunnington-mlg commented 1 year ago

P.s. I have updated the test wrapper code above as there are a few bug fixes and additional tests added since

msiddiq1400 commented 5 months ago

plz this need to be fixed, already spent a day, figuring out what the issue was, who can fix this issue with DBT ? @adamcunnington-mlg

msiddiq1400 commented 5 months ago

is there any way around this, a quick fix ? @adamcunnington-mlg

adamcunnington-mlg commented 5 months ago

@msiddiq1400 what issue are you talking about - you've not specified.

vittorfp commented 1 day ago

This feature seems to have already been implemented.

adamcunnington-mlg commented 1 day ago

@vittorfp really? where - can you share link to updated docs? was this in 1.8/1.9?