[Feature] Consistent ClientID (or new HashID) generated by SHA1 of unique (PII) fields

TomNUSDS commented 4 months ago

ClientIDs in the CSV specification are just string32.

Something like SHA1(SocialSecurityNumber + Full Last Name + Initial of First Name + Date of Birth) where all these fields are normalized.

Normalization:

SocialSecurityNumber: just the numeric digits
Date of Birth: YYYYMMDD
Full Last Name: Only Alpha characters uppercased (strip out hyphens/spaces, convert é to e)
Initial of First Name: Only a single Alpha character uppercased.

Example with invented name containing accents and dashes:

Name: Taylor McCoy-René
DoB: July 4, 1999
SSN: 123-00-6789

Combined String: "123006789 MCCOYRENE R 07041999"
SHA1 (80 rounds + Base62 encoding): ZLKRmaqw2sASJgXOvJFn0yeOnBi

Using CyberChef

A SHA1 is 128bits, which can be encoded into 27 characters using Base62 (AlphaNumeric)

Pros:

Different systems will generate the same ClientID consistently and help with duplicate resolution.
SHA1 should protect PII while still using it to create the ClientID (verify this assumption).

Issues:

If there's a typo in the last name, then it will generate a DIFFERENT ClientID. Would probably need some human reconciliation process if two different ClientIDs have the same SS# and DoB. Flagging potential duplicates for resolution probably requires some additional status fields in the Client data.

Changing 1 digit near end to see how it changes SHA1

TEST String with last character changed: "123006789 MCCOYRENE R 07041998"
SHA1 (Base62): SEa3N39nU6jNJRzoQDRsRXU98PA

TEST String with first character changed: "223006789 MCCOYRENE R 07041999"
SHA1 (Base62): ZoQloT5QtmuYhFSOpXVZ9aVykMk

TomNUSDS commented 4 months ago

Another idea:

If the SHA1 is only using 27 characters, then there's space in a Char32 to prepend some algo+version prefix like v1s1. (version 1 sha 1) This would be useful for forwards compatibility if there are changes to the algorithm. And backwards compatibility by being able to identify which ids do not use this approach.

TomNUSDS commented 3 months ago

One interesting issue with this approach. If the user doesn't supply full information that is correct, the generated ID will be wrong. (e.g. if they don't recall their SSNum).

Will changing the LastName or fixing the SocialSecurityNumber change the PersonID?

Maybe the PersonID should be a UUID (random) and a new field named HashID should be used for bi-directional syncing (or finding matching records across systems). If any of the primary fields are updated, then a new HashID is generated.

TomNUSDS commented 3 months ago

Further thoughts on this is that the above approach can be used to generate a NEW field called HashID

Make the field String32
Prepend v1. to string
Update if the fields are updated (but NOT if they are cleared)

Benefits:

HashID allows quick comparison of these fields between two systems without having to transmit the PII
If PII fields are ever cleared for privacy reasons, the HashID stays the same. So, if a client returns to the system after being cleared, the old record could still be found.

(NOTE: if the HashID can be longer than Str32, then investigate more security robust systems like HMAC-SHA256 https://en.wikipedia.org/wiki/HMAC)

TomNUSDS commented 3 months ago

One potential issue is if the SSN being empty (because a person refused or didn't know it). If this is common, then this approach probably could fail frequently.

Synthetic SSN could possibly fill help? Basically, generate a random SSN but keeping the two center numbers -00- (which is disallowed by SSN rules).

Also useful if there's an HudUUID value added to records that take the place of PersonIDs (which could be used by databases as the primary key, but may also be auto-incremented and thus overlap across different CoCs).

HUD-Data-Lab / Data.Exchange.and.Interoperability

[Feature] Consistent ClientID (or new HashID) generated by SHA1 of unique (PII) fields #26