INCEPTdk / omop_etl

3 stars 2 forks source link

add hash for person_id and visit_occurrence_id #107

Closed daplaci closed 4 months ago

daplaci commented 4 months ago

Closes #106

By default the HASH function in duckdb returns a UINT64, which is not supported in postgresql . I spent some time trying to make a UINT64 working but could not managed. so I just divided the output by two and left the columns as BigInteger types.

Test are still passing. the only decision we have to make is whether to hash only the source value or the actual content of the source_value_column ('CPRENC123456' vs 'cpr_enc|CPREND123456'). at the moment this solution only encodes the actual raw value

epiben commented 4 months ago

Would it be better, then, to do the hashing in Python where we can choose from many more hashing algorithms?

Perhaps we should check whether there a collisions after diving by 2?

Alternatively, maybe we could add this extension to Postgres (if you haven't already tried): https://github.com/petere/pguint

daplaci commented 4 months ago

@epiben I have read a bit about the hashing used by duckdb (we could also consider using md5 -> bigint or sha-256) and it seems like the likelihood of collision even dividing by two is soooooo remote .. also if that would even happen, since we are hashing primary keys, we would get the error of duplicate keys (I have tested very simply by dividing by 2**64). but if you fancy another hashing not supported in duckdb we could move it to python and use that it should not be a problem

epiben commented 4 months ago

Right, it's a good point that we will know thanks to the PK nature of the column. Then, all good! But, we should hash the combination of shak and courseid; see comment in code. Otherwise, LGTM.

daplaci commented 4 months ago

Done. I will open a new issue to exploit this new hash in the merge etl