cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Clean type2 data in warehouse for downstream calculations #251

Closed machow closed 3 years ago

machow commented 3 years ago

Let's create versions of each dataset in gtfs_schedule_type2 with the suffix _clean. E.g. gtfs_schedule_type2.stop_times_clean.

TODO

Easy things to do for now:

Notes:

Can hash using code like...

SELECT to_hex(md5(CONCAT(CAST(1 AS STRING), "___", CAST(2 AS STRING))))

(We can you doublecheck what FARM_FINGERPRINT does compared to to_hex(md5(...)), and put answer here? If it's a much smaller datatype let's definitely do that)

Nkdiaz commented 3 years ago

I am choosing to go with Farm_fingerprint instead of md5 because our only requirement is that it be a unique value for each unique input to avoid collisions, it has a simpler requirements than cryptographic hashes like md5 which require the hash to be reversible/random. More importantly it's return type is INT64 (compared to bytes for md5) and runs faster making it very useful for generating surrogate keys for large volumes of data

machow commented 3 years ago

Thanks for the explanation--glancing at SO, it definitely seems like farm_fingerprint is a better choice (for the reasons you gave)! I wonder why so many people are using an md5 hash :o. Maybe familiarity or something..

https://stackoverflow.com/a/57401816/1144523

machow commented 3 years ago

Actually--I wonder if it's because farm fingerprint is only 64 bit, while md5 hash is 128 bit?

machow commented 3 years ago

In any event, let's keep it the way you've got it, since it's a quick swap out if we need md5 later on!