Collisions in dbt_scd_id while calculating snapshots

rafaelkrysciak commented 5 months ago

Describe the bug

The snapshot calculation relies on the Teradata HASHROW function. The dbt_scd_id is generated for each row based on the provided unique_key and the current timestamp. However, the HASHROW function produces a 4-byte hash, which is highly prone to collisions. For instance, the values d3dadd49420542fb49ffbf6a77349b45 and 34f325fe5a4216f27357328b61c9eccb both produce the same hash 02-27-E3-B4. Similarly, the numbers 162181727 and 880145039 generate the same hash 2E-5B-FE-DD. In a source with 36 million numbers, we have over 180 thousand duplicate dbt_scd_id.

These collisions cause the snapshot update to fail with the error: [Error 7547] Target row updated by multiple source rows.

Steps To Reproduce

Create a source with the provided values as IDs and then try to create a snapshot of them.

Expected behavior

Calculating the snapshot without errors.

Screenshots and log output

The output of dbt --version:

Core:
  - installed: 1.7.11
  - latest:    1.8.0  - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - teradata: 1.7.2 - Up to date!

The operating system you're using: Windows 11

The output of python --version: Python 3.11.3

Additional context

datenbaecker commented 5 months ago

I had the same problem, would be nice if they can fix that! @rafaelkrysciak In the mean time you can redefine the macro that generates the hash:

{% macro teradata__snapshot_hash_arguments(args) -%}
    usrlib.hash_md5({%- for arg in args -%}
        coalesce(cast({{ arg }} as varchar(200)), '')
        {% if not loop.last %} || '|' || {% endif %}
    {%- endfor -%})
{%- endmacro %}

You may have to install hash_md5 on your system first...

rafaelkrysciak commented 5 months ago

Thanks @datenbaecker. It works fine 👍

tallamohan commented 4 months ago

@rafaelkrysciak , the fix for this issue is available in dbt-teradata 1.8.0 release.

Teradata / dbt-teradata