Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
509 stars 130 forks source link

[BUG] In PostgreSQL hashes are doubled in size #176

Closed dbaibakov closed 1 year ago

dbaibakov commented 1 year ago

Description In PostgreSQL when hash is calculated it is first converted to hex-string, upper-cased, and only then binary-encoded. As a result we have 32 bytes binary for storing 16-byte MD5, and 64-byte binary for storing SHA256. Here is a bit of code from my compiled model, just for illustration: with cte as ( select CAST(UPPER(MD5(CONCAT( COALESCE(NULLIF(UPPER(TRIM(CAST('iuagdfiusbdofusbdosibfor' AS VARCHAR))), ''), '^^') ))) AS BYTEA) AS hk ) select hk , encode(hk, 'hex') , length(hk) from cte

Seems we should first convert MD5-produced hex string into binary string with DECODE(MD5(...), 'hex'), and then convert the resutlt to BYTEA. There is also an additional behavior for SHA functions implemented in macros. The functions return binary strings as their results, but macros convert these binary strings to hex-strings, upper-case them, and then convert them to binary. So we have 2 unneeded conversions and doubling of size. Seems we shouldn't convert them to hex, just store the binaries we got from SHA functions.

In MS SQL implementation is right, hashes are calculated and stored as normal binaries, 16 and 32 bytes in size.

Environment

dbt version: 1.3.1 dbtvault version: 0.9.2 Database/Platform: docker image postgres:latest

To Reproduce Make a simple model, run it, check hash column size.

Expected behavior 16-byte MD5 hash, not 32-byte 32-byte SHA hash, not 64-byte

Screenshots image

DVAlexHiggs commented 1 year ago

Hi thanks for your report! This will be investigated in the new year.

This is odd as all of our tests are still passing and they have not been modified, so they should still be producing the correct hashes (correct lengths and values etc.)

They are converted to Hex because our hashing approach is intended to be consistent across all platforms and produce the same strings whether encoded in BINARY or just as strings, depending on the platform.

My instinct here is that this is intended functionality due to the differences in hashing in postgres; there was a reason we implemented it this way.

Will keep this updated!

dbaibakov commented 1 year ago

Yeh, sure, Merry Cristmas and upcoming Happy New Year ))

Just for your future investigation. Snowflake (default) - calculated as binary, stored as binary MS SQL - calculated as binary, stored as binary Databricks - calculated as hex-string, stored as text Bigquery - calculated as hex-string, stored as text Postgres - calculated ad hex-string, stored as binary

So it seems that there is no one-size-fits-all approach. There are 2 things to note: 1) all the engines calculate values right the way they are stored: binary as binary, text as text. Only Postgres calculates as text and stores the text as binary 2) data type selection - binary or text - depends on engine's capabilities. Unfortunately I don't have enough expertise in Bigquery and Databricks, so I can't suppose why hashes are stored as text there. For Databricks it's probably a compatibility issue as it works with files in Lakehouse architecture. But I think Postgres is much more like MS SQL than Databricks, they are traditional RDBMSes with proprietary data storage formats, and seems their abilities and features should be very close.

Anyway thanks in advance for your investigation))

DVAlexHiggs commented 1 year ago

Yeh, sure, Merry Cristmas and upcoming Happy New Year ))

Just for your future investigation. Snowflake (default) - calculated as binary, stored as binary MS SQL - calculated as binary, stored as binary Databricks - calculated as hex-string, stored as text Bigquery - calculated as hex-string, stored as text Postgres - calculated ad hex-string, stored as binary

So it seems that there is no one-size-fits-all approach. There are 2 things to note: 1) all the engines calculate values right the way they are stored: binary as binary, text as text. Only Postgres calculates as text and stores the text as binary 2) data type selection - binary or text - depends on engine's capabilities. Unfortunately I don't have enough expertise in Bigquery and Databricks, so I can't suppose why hashes are stored as text there. For Databricks it's probably a compatibility issue as it works with files in Lakehouse architecture. But I think Postgres is much more like MS SQL than Databricks, they are traditional RDBMSes with proprietary data storage formats, and seems their abilities and features should be very close.

Anyway thanks in advance for your investigation))

It's impossible to do the same in every platform due to platform specific differences, however the aim is to have the same hash value/output regardless of platform.

You're right that postgres is inconsistent but as I say I think it's due to the way postgres treats hashes. This was contributed by the community so I'm not 100% sure of all the details.

Thanks for the details you have provided!

@johnoscott may be able to provide some insight here

johnoscott commented 1 year ago

I will investigate further. If I recall, there was a need to convert to uppercase after the text output of md5() which was lowercase hexadecimal. I will double check for any unnecessary conversions though.

DVTimWilson commented 1 year ago

I will investigate further. If I recall, there was a need to convert to uppercase after the text output of md5() which was lowercase hexadecimal. I will double check for any unnecessary conversions though.

Thank you both for all the details. I think you are correct with:

Seems we should first convert MD5-produced hex string into binary string with DECODE(MD5(...), 'hex'), and then convert the resutlt to BYTEA.

Looks like we have taken a wrong turn en route from hex string to bytea! I shall do some investigative testing.

isoctcolo commented 1 year ago

Do you need some testing here? - I'm in the process of building a larger warehouse on postgresql with dbtvault.

johnoscott commented 1 year ago

@isoctcolo sure, that would help I think.

dbaibakov commented 1 year ago

Dear clleagues, any news on this issue? Probably I could help with coding? In this case I need access to dev repository.

DVTimWilson commented 1 year ago

Dear clleagues, any news on this issue? Probably I could help with coding? In this case I need access to dev repository.

Sorry for the slow response on this, yes we have a solution and we will be testing it soon once we have internal resource.

DVAlexHiggs commented 1 year ago

Fixed in 0.9.7. At its surface it looks like a rather minor change, but this required a lot of testing and experimentation to get it right 😊

Thank you all for your patience!