HUD-Data-Lab / Data.Exchange.and.Interoperability

Repository for Homeless Management Information System (HMIS) development and management of products to support data exchange and interoperability
GNU General Public License v3.0
2 stars 6 forks source link

[Feature] Consistent ClientID (or new HashID) generated by SHA1 of unique (PII) fields #26

Open TomNUSDS opened 3 months ago

TomNUSDS commented 3 months ago

ClientIDs in the CSV specification are just string32.

Something like SHA1(SocialSecurityNumber + Full Last Name + Initial of First Name + Date of Birth) where all these fields are normalized.

Normalization:

Example with invented name containing accents and dashes:

Name: Taylor McCoy-René
DoB: July 4, 1999
SSN: 123-00-6789

Combined String: "123006789 MCCOYRENE R 07041999"
SHA1 (80 rounds + Base62 encoding): ZLKRmaqw2sASJgXOvJFn0yeOnBi

Using CyberChef

A SHA1 is 128bits, which can be encoded into 27 characters using Base62 (AlphaNumeric)

Pros:

Issues:

Changing 1 digit near end to see how it changes SHA1

TEST String with last character changed: "123006789 MCCOYRENE R 07041998"
SHA1 (Base62): SEa3N39nU6jNJRzoQDRsRXU98PA

TEST String with first character changed: "223006789 MCCOYRENE R 07041999"
SHA1 (Base62): ZoQloT5QtmuYhFSOpXVZ9aVykMk
TomNUSDS commented 3 months ago

Another idea:

TomNUSDS commented 2 months ago

One interesting issue with this approach. If the user doesn't supply full information that is correct, the generated ID will be wrong. (e.g. if they don't recall their SSNum).

Will changing the LastName or fixing the SocialSecurityNumber change the PersonID?

Maybe the PersonID should be a UUID (random) and a new field named HashID should be used for bi-directional syncing (or finding matching records across systems). If any of the primary fields are updated, then a new HashID is generated.

TomNUSDS commented 2 months ago

Further thoughts on this is that the above approach can be used to generate a NEW field called HashID

Benefits:

(NOTE: if the HashID can be longer than Str32, then investigate more security robust systems like HMAC-SHA256 https://en.wikipedia.org/wiki/HMAC)

TomNUSDS commented 2 months ago

One potential issue is if the SSN being empty (because a person refused or didn't know it). If this is common, then this approach probably could fail frequently.

Synthetic SSN could possibly fill help? Basically, generate a random SSN but keeping the two center numbers -00- (which is disallowed by SSN rules).

Also useful if there's an HudUUID value added to records that take the place of PersonIDs (which could be used by databases as the primary key, but may also be auto-incremented and thus overlap across different CoCs).