The current implementation of diepvries does not ensure that the hashes (hashkey/hashdiff) generated are deterministic. The hash generation expression simply converts the field stored in the extraction table into a string. Although, the string representation of some data types may not be deterministic.
Example:
An extraction table contains a field called modified_timestamp, stored as an epoch (integer) with value 1628787326.
This field will populate a field with the same name, but stored as a timestamp_ntz in the data vault table.
When a hashkey using this field is generated, the current implementation of the framework does this: COALESCE(CAST(modified_timestamp AS VARCHAR), '').
If, for some reason, the extraction process is changed and this field starts being stored as a string in yyyy-mm-ddThh24:mi:ss format, the hashkey for the same data set would be different, given that the representation of 1628787326 in varchar is 2021-08-12T16:55:26Z.
So, the hashdiff for this field would effectively be calculated using COALESCE(CAST(1628787326 AS VARCHAR), '') before the extraction process update and using COALESCE(CAST('2021-08-12T16:55:26Z' AS VARCHAR), '') after the extraction process update, which produces different results, hence different hashkeys.
Solution:
In order to ensure deterministic hashes, diepvries should do a two step conversion:
Convert the field in the extraction table to its target data type in DV;
Convert the result from step 1 to string in a deterministic way.
After analysing all snowflake data types, I concluded that most of them have a deterministic representation when converted to a string, exception made for time/date data types (timestamp_ntz, timestamp_tz, timestamp_ltz, time, date) and the geography fields.
Given this, the following rules should be applied:
time: apply a TO_CHAR using hh24:mi:ss.ff9 format;
date: apply a TO_CHAR using yyyy-mm-dd format;
timestamp_ntz: apply a TO_CHAR using yyyy-mm-dd hh24:mi:ss.ff9 format;
timestamp_ts/timestamp_ltz: apply a TO_CHAR using yyyy-mm-dd hh24:mi:ss.ff9 tzhtzm format;
geography: apply a ST_ASTEXT;
Every other data type: Apply a simple CAST to VARCHAR.
Implementation:
Added support for GEOGRAPHY in FieldDataType class.
New property called hash_concatenation_sql created in Field class, applying the rules mentioned in the Solution section above.
New property called data_type_sql created in Field class. This property represents the field data type in a SQL string. It was created to avoid code duplication between the ddl_in_staging and hash_concatenation_sql, given that both of them need to represent the field data type as SQL.
Changed ddl_in_staging to use data_type_sql property.
Change Table.hashkey_sql to use the new Field.hash_concatenation_sql property instead of the SQL template constants used before;
Change Satellite.hashdiff_sql to use the new Field.hash_concatenation_sql property instead of the SQL template constants used before;
Removed SQL template constants that became obsolete with the introduction of hash_concatenation_sql ( BUSINESS_KEY_SQL_TEMPLATE, CHILD_KEY_SQL_TEMPLATE, DESCRIPTIVE_FIELD_SQL_TEMPLATE)
Adapted test suite to new hash generation formulas;
Added new fields to hs_customer test table, ensuring that all data types are covered;
Added new fields to hs_customer in the snowflake deserializer tests;
(while there) applied isort in doc/snippets
(while there) changed asserts across the test suite to apply an assert test_result == expected_result as opposed to assert expected_result == test_result.
(while there) Renamed expected_results to expected_result in test suite. The plural form makes it seem that there are multiple results when there's only one.
Context
The current implementation of
diepvries
does not ensure that the hashes (hashkey/hashdiff) generated are deterministic. The hash generation expression simply converts the field stored in the extraction table into a string. Although, the string representation of some data types may not be deterministic.Example:
An extraction table contains a field called
modified_timestamp
, stored as an epoch (integer) with value1628787326
.This field will populate a field with the same name, but stored as a
timestamp_ntz
in the data vault table.When a hashkey using this field is generated, the current implementation of the framework does this:
COALESCE(CAST(modified_timestamp AS VARCHAR), '')
.If, for some reason, the extraction process is changed and this field starts being stored as a string in
yyyy-mm-ddThh24:mi:ss
format, the hashkey for the same data set would be different, given that the representation of1628787326
in varchar is2021-08-12T16:55:26Z
.So, the hashdiff for this field would effectively be calculated using
COALESCE(CAST(1628787326 AS VARCHAR), '')
before the extraction process update and usingCOALESCE(CAST('2021-08-12T16:55:26Z' AS VARCHAR), '')
after the extraction process update, which produces different results, hence different hashkeys.Solution:
In order to ensure deterministic hashes,
diepvries
should do a two step conversion:After analysing all snowflake data types, I concluded that most of them have a deterministic representation when converted to a string, exception made for time/date data types (
timestamp_ntz
,timestamp_tz
,timestamp_ltz
,time
,date
) and the geography fields.Given this, the following rules should be applied:
time
: apply aTO_CHAR
usinghh24:mi:ss.ff9
format;date
: apply aTO_CHAR
usingyyyy-mm-dd
format;timestamp_ntz
: apply aTO_CHAR
usingyyyy-mm-dd hh24:mi:ss.ff9
format;timestamp_ts/timestamp_ltz
: apply aTO_CHAR
usingyyyy-mm-dd hh24:mi:ss.ff9 tzhtzm
format;geography
: apply aST_ASTEXT
;CAST
toVARCHAR
.Implementation:
GEOGRAPHY
inFieldDataType
class.hash_concatenation_sql
created inField
class, applying the rules mentioned in theSolution
section above.data_type_sql
created inField
class. This property represents the field data type in a SQL string. It was created to avoid code duplication between theddl_in_staging
andhash_concatenation_sql
, given that both of them need to represent the field data type as SQL.ddl_in_staging
to usedata_type_sql
property.Table.hashkey_sql
to use the newField.hash_concatenation_sql
property instead of the SQL template constants used before;Satellite.hashdiff_sql
to use the newField.hash_concatenation_sql
property instead of the SQL template constants used before;hash_concatenation_sql
(BUSINESS_KEY_SQL_TEMPLATE
,CHILD_KEY_SQL_TEMPLATE
,DESCRIPTIVE_FIELD_SQL_TEMPLATE
)hs_customer
test table, ensuring that all data types are covered;hs_customer
in the snowflake deserializer tests;isort
indoc/snippets
assert test_result == expected_result
as opposed toassert expected_result == test_result
.expected_results
toexpected_result
in test suite. The plural form makes it seem that there are multiple results when there's only one.