PicnicSupermarket / diepvries

The Picnic Data Vault framework.
https://diepvries.picnic.tech
MIT License
126 stars 15 forks source link

Make hashkey and hashdiff generation deterministic #17

Closed dlouseiro closed 3 years ago

dlouseiro commented 3 years ago

Context

The current implementation of diepvries does not ensure that the hashes (hashkey/hashdiff) generated are deterministic. The hash generation expression simply converts the field stored in the extraction table into a string. Although, the string representation of some data types may not be deterministic.

Example:

An extraction table contains a field called modified_timestamp, stored as an epoch (integer) with value 1628787326.

This field will populate a field with the same name, but stored as a timestamp_ntz in the data vault table.

When a hashkey using this field is generated, the current implementation of the framework does this: COALESCE(CAST(modified_timestamp AS VARCHAR), '').

If, for some reason, the extraction process is changed and this field starts being stored as a string in yyyy-mm-ddThh24:mi:ss format, the hashkey for the same data set would be different, given that the representation of 1628787326 in varchar is 2021-08-12T16:55:26Z.

So, the hashdiff for this field would effectively be calculated using COALESCE(CAST(1628787326 AS VARCHAR), '') before the extraction process update and using COALESCE(CAST('2021-08-12T16:55:26Z' AS VARCHAR), '') after the extraction process update, which produces different results, hence different hashkeys.

Solution:

In order to ensure deterministic hashes, diepvries should do a two step conversion:

  1. Convert the field in the extraction table to its target data type in DV;
  2. Convert the result from step 1 to string in a deterministic way.

After analysing all snowflake data types, I concluded that most of them have a deterministic representation when converted to a string, exception made for time/date data types (timestamp_ntz, timestamp_tz, timestamp_ltz, time, date) and the geography fields.

Given this, the following rules should be applied:

Implementation: