Closed machow closed 3 years ago
I am choosing to go with Farm_fingerprint instead of md5 because our only requirement is that it be a unique value for each unique input to avoid collisions, it has a simpler requirements than cryptographic hashes like md5 which require the hash to be reversible/random. More importantly it's return type is INT64 (compared to bytes for md5) and runs faster making it very useful for generating surrogate keys for large volumes of data
Thanks for the explanation--glancing at SO, it definitely seems like farm_fingerprint is a better choice (for the reasons you gave)! I wonder why so many people are using an md5 hash :o. Maybe familiarity or something..
Actually--I wonder if it's because farm fingerprint is only 64 bit, while md5 hash is 128 bit?
In any event, let's keep it the way you've got it, since it's a quick swap out if we need md5 later on!
Let's create versions of each dataset in gtfs_schedule_type2 with the suffix
_clean
. E.g.gtfs_schedule_type2.stop_times_clean
.TODO
Easy things to do for now:
date
column (check https://developers.google.com/transit/gtfs/reference for DATE columns),stop_id
calitp_hash
,calitp_extracted_at
)to_hex(md5(...))
? It sounds like it would be ideal, since it returns an INT64gtfs_schedule_type2
and flag any other columns that might need cleaning.Notes:
Can hash using code like...
(We can you doublecheck what FARM_FINGERPRINT does compared to
to_hex(md5(...))
, and put answer here? If it's a much smaller datatype let's definitely do that)