Closed daplaci closed 4 months ago
Would it be better, then, to do the hashing in Python where we can choose from many more hashing algorithms?
Perhaps we should check whether there a collisions after diving by 2?
Alternatively, maybe we could add this extension to Postgres (if you haven't already tried): https://github.com/petere/pguint
@epiben I have read a bit about the hashing used by duckdb (we could also consider using md5 -> bigint or sha-256) and it seems like the likelihood of collision even dividing by two is soooooo remote .. also if that would even happen, since we are hashing primary keys, we would get the error of duplicate keys (I have tested very simply by dividing by 2**64). but if you fancy another hashing not supported in duckdb we could move it to python and use that it should not be a problem
Right, it's a good point that we will know thanks to the PK nature of the column. Then, all good! But, we should hash the combination of shak and courseid; see comment in code. Otherwise, LGTM.
Done. I will open a new issue to exploit this new hash in the merge etl
Closes #106
By default the HASH function in duckdb returns a UINT64, which is not supported in postgresql . I spent some time trying to make a UINT64 working but could not managed. so I just divided the output by two and left the columns as BigInteger types.
Test are still passing. the only decision we have to make is whether to hash only the source value or the actual content of the source_value_column ('CPRENC123456' vs 'cpr_enc|CPREND123456'). at the moment this solution only encodes the actual raw value