dbt-msft / tsql-utils

dbt-utils for the dbt-msft family of packages
MIT License
25 stars 25 forks source link

Consider using varbinary in hash() instead of varchar #26

Closed infused-kim closed 3 years ago

infused-kim commented 3 years ago

Hi Guys,

First of all, thank you for creating this package. Your efforts are much appreciated.

I noticed the hash adapter converts the md5 bytes into a varchar:

{% macro sqlserver__hash(field) %}
    convert(varchar(50), hashbytes('md5', {{field}}), 2)
{% endmacro %}

If we leave the hash as varbinary, it uses only 16 bytes of data, but as varchar it uses 32. I suspect it's probably also slower to do joins on 32 character varchar than on a varbinary.

Would it be possible to take out that conversion?

I have been using varbinary in my own adapter for several months without problems. The only issue is that Power Bi doesn't support relationships on varbinary columns.

To solve that problem I created a separate macro that does the conversion when it's needed:

{% macro cast_hash_to_str(col) -%}
  convert(varchar(32), {{ col }}, 2)
{%- endmacro %}

And in my final model that is being consumed by PBI I just do something like...

final as (
    select
        {{ cast_hash_to_str('url_key') }} as UrlKey,
        ...
    from whatever
)

This way the hashes use less space in the DB, perform faster joins and when necessary, such as in Power Bi they are converted to strings.

mikaelene commented 3 years ago

This is interesting. It will be a "breaking change" as all snapshots etc will stop working. As well as a decrease in user experience when you need to cast them for use in a BI-tool. What do other think? Maybe it can be a configuration?

dataders commented 3 years ago

@mikaelene agreed that we could introduce it as a config option. I know dbt-utils does this already, perhaps @jtcohen6 could point us to a good example of how the pattern of introducing breaking changes to macros can be carried out.

Another example of this would be the introduction of the MERGE functionality that is now in preview for Synapse.

jtcohen6 commented 3 years ago

Check out a comparable example: https://github.com/fishtown-analytics/dbt-utils/issues/303. I think this makes sense as an optional "argument," default false for now with a deprecation warning, someday switch the default true. Given that dbt_utils.hash only takes one argument, I think it would make sense to enable that via var instead:

{% macro sqlserver__hash(field) %}
  {% if var('tsql_utils_hash_return_binary', false) %}
    convert(varchar(50), hashbytes('md5', {{field}}), 2)
  {% else %}
    hashbytes('md5', {{field}})
  {% endif %}
{% endmacro %}
vars:
  dbt_utils_dispatch_list: ['tsql_utils']
  tsql_utils_hash_return_binary: true     # default false (current behavior)

That said, we've added the string typecast on other adapters whose hash functions return binary. (PR for BigQuery, from a while ago: https://github.com/fishtown-analytics/dbt-utils/pull/74.) This was for quality-of-life reasons: easier to join/filter/etc on a string column.

infused-kim commented 3 years ago

Thanks guys, the variable checking option looks great, but if other adapters are implementing the type as string, then perhaps it's best to leave it as it is.

Today I was able to update to the latest dbt, dbt-utils and tsql-utils version. And I was able to configure it to use my own adapter.

In case someone else wants to change the behavior, here is how you can do it:

In dbt_project.yml set...:

vars:
  dbt_utils_dispatch_list: ['your_dbt_project_name', 'tsql_utils']

Since your project is listed first, DBT utils will look for adapter macros in your project first and only if it can't find one look in tsql_utils. This effectively allows you to override the default tsql-utils behavior.

To override an adapter, simply create a macros/sqlserver_adaptors.sql:

{%- macro sqlserver__hash(field) -%}
    hashbytes('md5', {{field}})
{%- endmacro -%}