dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.64k stars 1.6k forks source link

[Feature] Postgres 15+ unique index NULLS NOT DISTINCT #9472

Open FrankTub opened 7 months ago

FrankTub commented 7 months ago

Is this your first time submitting a feature request?

Describe the feature

On some of the models we have we have created an unique index. By default the creation of an unique index (on multiple columns) on Postgres was done with ignoring NULL values as unique values. See this for more info.

Since Postgres 15 it is an option to create an index with the option NULLS NOT DISTINCT. This has benefits for query plans in Postgres since with the current setup of an index in Postgres it does not know for sure if a combination of columns is truly unique. In the Postgres docs you will also find:

NULLS NOT DISTINCT Specifies whether for a unique index, null values should be considered distinct (not equal). The default is that they are > distinct, so that a unique index could contain multiple null values in a column.

Current situation

So currently I have a model with something like:

{{ config(
    indexes=[
      {'columns': ['date_actual', 'customer_no'], 'unique': True},
    ]
) }}

This generate the following SQL:

    create unique index if not exists
  "0dbafe9be87c73a825dcef744a055fd8"
  on "staging"."transformations"."daily_customer_balances" 
  (date_actual, customer_no);

Requested new situation

After my proposal I would like to have the option to set a flag at creation of the index, like something below:

{{ config(
    indexes=[
      {'columns': ['date_actual', 'customer_no'], 'unique': True, 'nulls_not_distinct': True},
    ]
) }}

And this should generate the following SQL:

    create unique index if not exists
  "0dbafe9be87c73a825dcef744a055fd8"
  on "staging"."transformations"."daily_customer_balances" 
  (date_actual, customer_no) nulls not distinct;

It would be perfectly acceptable to leave the default option of nulls_not_distinct to False.

Describe alternatives you've considered

There are two ways to do this anyways, but I think the proposed solution is more neat for my use case.

1) Create a post-hook on your dbt model and create the index yourself writing plain SQL. 2) Enforce a dbt model contract

Post-hook

This defies the purpose of the config index block in my opinion.

Model contract

This enforces a user to force all sorts of other constraints the data should look like that is not relevant to my use case.

The whole purpose of this request is to enhance downstream query plans and thereby improving performance of your transformation pipeline.

Who will this benefit?

Only users that use the dbt-postgres adapter that use Postgres 15+.

Are you interested in contributing this feature?

Yes, but really not sure where to start or how to do it.

Anything else?

No response

dbeatty10 commented 7 months ago

Thanks for raising this @FrankTub !

This wouldn't be a high priority for us to implement, so I'm going to label it as "help wanted".

Your proposed syntax looks good to me πŸ‘:

{{ config(
    indexes=[
      {'columns': ['date_actual', 'customer_no'], 'unique': True, 'nulls_not_distinct': True},
    ]
) }}

Implementation hints

Some of the relevant code is here.

To hard-code your project to always use nulls not distinct for unique indexes, you could add something like this to your project (untested!):

{% macro postgres__get_create_index_sql(relation, index_dict) -%}
  {%- set index_config = adapter.parse_index(index_dict) -%}
  {%- set comma_separated_columns = ", ".join(index_config.columns) -%}
  {%- set index_name = index_config.render(relation) -%}

  create {% if index_config.unique -%}
    unique
  {%- endif %} index if not exists
  "{{ index_name }}"
  on {{ relation }} {% if index_config.type -%}
    using {{ index_config.type }}
  {%- endif %}
  ({{ comma_separated_columns }}){% if index_config.unique -%} nulls not distinct{%- endif %};
{%- endmacro %}

Alternatively, to make it configurable, you could try changing {% if index_config.unique -%} nulls not distinct{%- endif %} to be {% if index_config.nulls_not_distinct -%} nulls not distinct{%- endif %} instead. I'm not sure how strict the config validation is here (via parse_index), so that may not work as-is and need further code changes.

Side note: While not directly related to your feature proposal, https://github.com/dbt-labs/dbt-core/issues/6997 also involves SQL's three-valued logic.

FrankTub commented 7 months ago

Thanks for such a quick response @dbeatty10 ! Totally agree that this is not a high prio case. Will definitely test out to make it work in our project.

Do you see any value in adding it as I proposed? If so I will open up a PR, otherwise just keep it in my own project.

dbeatty10 commented 7 months ago

Do you see any value in adding it as I proposed? If so I will open up a PR

Your proposed syntax looks good to me, and you can open a PR πŸ‘

We might not be able to review it right away due to priority level though.

If you can get it to work with a simple macro override (and share the details), at least it will give other people code to use in the meantime.

dbeatty10 commented 7 months ago

Something to note: our integration tests don't currently use Postgres 15. So https://github.com/dbt-labs/dbt-core/issues/8273 would need to be implemented in order to actually add it to the tests.