Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
514 stars 131 forks source link

Choosing a hashing algorithm in dbtvault: hash or lossless #35

Closed JensGorman closed 3 years ago

JensGorman commented 3 years ago

Hi, In https://dbtvault.readthedocs.io/en/v0.6/best_practices/#hashing-best-practices you recommend to use MD5 or SHA-256 as surrogate key generator for primary keys. I think that a lossless generator would be a better choice.

dbt_utils.surrogate_key uses the MD5 hash function in Snowflake to generate a surrogate primary key from a natural key by default. Hash algorithms are lossy: There is a small probability that hash collisions occur, so that two different natural keys get the same hash value. The risk can be reduced by using the hash function SHA-256 or even SHA-512, but there is still a remaining very, very small probability of risk collisions. SHA-256 serves cryptographic purposes, so it takes 64 rounds to calculate the hash value and uses quite a lot of CPU time compared to a simple lossless functions, e.g. Huffman ( https://en.wikipedia.org/wiki/Huffman_coding). So an alternative is:

  1. Concatenate all columns of the natural key.
  2. Use a simple lossless compression algorithm.

Snowflake supports SNAPPY,ZLIB,ZSTD and BZ2. ( https://docs.snowflake.com/en/sql-reference/functions/compress.html)

Would do you think?

Best regards, Jens

balmasi commented 3 years ago

Even with a 128-bit hashing algorithm like MD5, on average, you’d have to generate roughly 5.8 million records a second for the next 1,000 years before getting a collision.

Having a fallback mechanism is recommended, but you probably won't see a hash collision in your lifetime unless perhaps you're dealing with an enormous amount data.

Ideally, you'd want the keys to stay small as well. Lossless compression is unbounded which is not a good choice for a key

DatavaultUK commented 3 years ago

In a sense the 'problem' you talk about has been discussed extensively and solved by the community. i) hash keys are optional, you can use the business key ii) if you hash use a hash algorithm with an appropriate addressable space to reduce collisions to a small probability iii) most time md5 is adequate, only occasionally will you need a larger space

Sometimes even a 1 in a million probability of a collision is too much for a detail-oriented architect/engineer. Feel free to use SHA256 or use the natural key approach set out below.

If you look at your data sets you'll find that only a small number of tables have a lot of rows. A lot of rows usually indicates a transaction or log table. If these are several billions of rows the md5 collision may get closer to 1% probability within the whole data set. What we sometimes do is use a md5 hash pk for all tables apart from these high-volume tables where we fall back to use the log/transaction id in its natural raw form. Works well for us. Never had a collision.

You may find that the log/transaction natural key is shorter than the MD5 hash string anyway(!).

JensGorman commented 3 years ago

I have one table with 3 billion new rows every day and in a worst case a shipment routing can take a wrong route. I will go with the natural key. Thank you for sharing your insights.