Clean type2 data in warehouse for downstream calculations

machow commented 3 years ago

Let's create versions of each dataset in gtfs_schedule_type2 with the suffix _clean. E.g. gtfs_schedule_type2.stop_times_clean.

TODO

Easy things to do for now:

parse any date column (check https://developers.google.com/transit/gtfs/reference for DATE columns)
create a surrogate key. This is a key that's unique for each entry in a table. It should be a hash of the ID column and calitp_extracted_at.
- stops - stop_key (i.e.. hash ~~stop_id~~, calitp_hash, calitp_extracted_at)
- routes - route_key
- trips - trip_key
Can we use FARM_FINGERPRINT instead of to_hex(md5(...))? It sounds like it would be ideal, since it returns an INT64
Go through GTFS spec for our tables in gtfs_schedule_type2 and flag any other columns that might need cleaning.
Once done, or if you want to test on the pipeline, sync with @cvc5185.

Notes:

Can hash using code like...

SELECT to_hex(md5(CONCAT(CAST(1 AS STRING), "___", CAST(2 AS STRING))))

(We can you doublecheck what FARM_FINGERPRINT does compared to to_hex(md5(...)), and put answer here? If it's a much smaller datatype let's definitely do that)

Nkdiaz commented 3 years ago

I am choosing to go with Farm_fingerprint instead of md5 because our only requirement is that it be a unique value for each unique input to avoid collisions, it has a simpler requirements than cryptographic hashes like md5 which require the hash to be reversible/random. More importantly it's return type is INT64 (compared to bytes for md5) and runs faster making it very useful for generating surrogate keys for large volumes of data

machow commented 3 years ago

Thanks for the explanation--glancing at SO, it definitely seems like farm_fingerprint is a better choice (for the reasons you gave)! I wonder why so many people are using an md5 hash :o. Maybe familiarity or something..

https://stackoverflow.com/a/57401816/1144523

machow commented 3 years ago

Actually--I wonder if it's because farm fingerprint is only 64 bit, while md5 hash is 128 bit?

machow commented 3 years ago

In any event, let's keep it the way you've got it, since it's a quick swap out if we need md5 later on!

cal-itp / data-infra

Clean type2 data in warehouse for downstream calculations #251

TODO

Notes: