Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
513 stars 131 forks source link

[FEATURE] Possibility to hashing algorithm as parameter #78

Closed koillinengit closed 2 years ago

koillinengit commented 3 years ago

Microsoft databases like Synapse, SQL server, Azure SQL use hashing algorithm called hashbytes instead of md5_binary.

I would like that I could give the algorithm as parameter, same way as hash type, concat_string and null_placeholder_string.

I tried to make function called md5_binary in synapse that would redirect it to hashbytes built-in function, but it did not seem to be possible. That is because built-in functions are without schema name and it is not possible to do other kind of functions without schema.

Best place to do this modifications is (I suspect) macros/supporting/hash.sql script. Probably it needs as separate format clause, because hashbytes function requires additional input parameter which tells hash type (https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver15)

{#- Select hashing algorithm -#}
{%- if hash == 'MD5' -%}
    {%- set hash_alg = 'MD5_BINARY' -%}
    {%- set hash_size = 16 -%}
{%- elif hash == 'SHA' -%}
    {%- set hash_alg = 'SHA2_BINARY' -%}
    {%- set hash_size = 32 -%}
{%- else -%}
    {%- set hash_alg = 'MD5_BINARY' -%}
    {%- set hash_size = 16 -%}
{%- endif -%}

{%- set standardise = "NULLIF(UPPER(TRIM(CAST([EXPRESSION] AS VARCHAR))), '')" %}

{#- Alpha sort columns before hashing if a hashdiff -#}
{%- if is_hashdiff and dbtvault.is_list(columns) -%}
    {%- set columns = columns|sort -%}
{%- endif -%}

{#- If single column to hash -#}
{%- if columns is string -%}
    {%- set column_str = dbtvault.as_constant(columns) -%}
    {{- "CAST(({}({})) AS BINARY({})) AS {}".format(hash_alg, standardise | replace('[EXPRESSION]', column_str), hash_size, alias) | indent(4) -}}
DVAlexHiggs commented 3 years ago

We do not currently support SQL Server, however we are currently developing it. We will be releasing it early next year.

For now, you may override the hash macro with your own implementation, or use multi dispatch to provide a SQL Server specific implementation.

https://dbtvault.readthedocs.io/extending/

https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch

koillinengit commented 3 years ago

Yes, I will do my own custom version for now and start waiting for the SQL server or in fact Synapse version...

koillinengit commented 3 years ago

I also note that there has been used some non ansi-sql statements which are not compatible with different database technologies.

1) For example IFNULL statements (hash.sql)

This is how it is now: {{- "\nIFNULL({}, '{}')".format(standardise | replace('[EXPRESSION]', column_str), null_placeholder_string) | indent(4) -}}

Same with case statement which is ansi-sql compatible: {{- "\nCASE WHEN {} IS NOT NULL THEN {} ELSE '{}' END".format(standardise | replace('[EXPRESSION]', column_str), standardise | replace('[EXPRESSION]', column_str), null_placeholder_string) | indent(4) -}}

2) "QUALIFY row_number = 1" is used in below scripts. I suspect that simple where clause like "WHERE row_number = 1" would do the trick also. And it is more compatible.

eff_sat.sql hub.sql link.sql ma_sat.sql sat.sql

3) Also there is limitation in Microsoft with use of the CONCAT_WS function. It only accepts 3 to 254 parameters (hash.sql). So if there is a lot columns in query, the concatenation have to be re-concatenated after 254 concatenations.

DVAlexHiggs commented 3 years ago

Yes, some of the SQL used in Snowflake is Snowflake only. We do this because it may be more efficient, concise etc to use those statements instead of more common ones.

Some of these things we will either be making macros for (e.g. CONCAT_WS) or just using alternatives in the database specific implementation.

DVAlexHiggs commented 2 years ago

Closing this as the issue stems from the current lack of support for particular platforms. This is something we will solve when implementing hash on new platforms. Once we have a few platforms implemented, we will then understand more how we can refactor our macros to be more standard across platforms. Thanks raising this issue, as it makes s great point.