Make hashkey and hashdiff generation deterministic

Context

The current implementation of diepvries does not ensure that the hashes (hashkey/hashdiff) generated are deterministic. The hash generation expression simply converts the field stored in the extraction table into a string. Although, the string representation of some data types may not be deterministic.

Example:

An extraction table contains a field called modified_timestamp, stored as an epoch (integer) with value 1628787326.

This field will populate a field with the same name, but stored as a timestamp_ntz in the data vault table.

When a hashkey using this field is generated, the current implementation of the framework does this: COALESCE(CAST(modified_timestamp AS VARCHAR), '').

If, for some reason, the extraction process is changed and this field starts being stored as a string in yyyy-mm-ddThh24:mi:ss format, the hashkey for the same data set would be different, given that the representation of 1628787326 in varchar is 2021-08-12T16:55:26Z.

So, the hashdiff for this field would effectively be calculated using COALESCE(CAST(1628787326 AS VARCHAR), '') before the extraction process update and using COALESCE(CAST('2021-08-12T16:55:26Z' AS VARCHAR), '') after the extraction process update, which produces different results, hence different hashkeys.

Solution:

In order to ensure deterministic hashes, diepvries should do a two step conversion:

Convert the field in the extraction table to its target data type in DV;
Convert the result from step 1 to string in a deterministic way.

After analysing all snowflake data types, I concluded that most of them have a deterministic representation when converted to a string, exception made for time/date data types (timestamp_ntz, timestamp_tz, timestamp_ltz, time, date) and the geography fields.

Given this, the following rules should be applied:

time: apply a TO_CHAR using hh24:mi:ss.ff9 format;
date: apply a TO_CHAR using yyyy-mm-dd format;
timestamp_ntz: apply a TO_CHAR using yyyy-mm-dd hh24:mi:ss.ff9 format;
timestamp_ts/timestamp_ltz: apply a TO_CHAR using yyyy-mm-dd hh24:mi:ss.ff9 tzhtzm format;
geography: apply a ST_ASTEXT;
Every other data type: Apply a simple CAST to VARCHAR.

Implementation:

Added support for GEOGRAPHY in FieldDataType class.
New property called hash_concatenation_sql created in Field class, applying the rules mentioned in the Solution section above.
New property called data_type_sql created in Field class. This property represents the field data type in a SQL string. It was created to avoid code duplication between the ddl_in_staging and hash_concatenation_sql, given that both of them need to represent the field data type as SQL.
Changed ddl_in_staging to use data_type_sql property.
Change Table.hashkey_sql to use the new Field.hash_concatenation_sql property instead of the SQL template constants used before;
Change Satellite.hashdiff_sql to use the new Field.hash_concatenation_sql property instead of the SQL template constants used before;
Removed SQL template constants that became obsolete with the introduction of hash_concatenation_sql ( BUSINESS_KEY_SQL_TEMPLATE, CHILD_KEY_SQL_TEMPLATE, DESCRIPTIVE_FIELD_SQL_TEMPLATE)
Adapted test suite to new hash generation formulas;
Added new fields to hs_customer test table, ensuring that all data types are covered;
Added new fields to hs_customer in the snowflake deserializer tests;
(while there) applied isort in doc/snippets
(while there) changed asserts across the test suite to apply an assert test_result == expected_result as opposed to assert expected_result == test_result.
(while there) Renamed expected_results to expected_result in test suite. The plural form makes it seem that there are multiple results when there's only one.

PicnicSupermarket / diepvries