kaungst commented 3 months ago

Request for Comments: Client Deduplication API

Summary

This RFC proposes the addition of a new REST endpoint to help deduplicate client records. The endpoint will take a client's first name, last name, date of birth, and social security number as inputs and return a client ID. This client ID will either be an existing client ID if a match is found, or a new client will be created and its ID will be provided for future requests.

Motivation

Non-HMIS vendors do not have a standard means for confirming the existence of an individual within a HMIS. To decrease the liklihood of duplicate records, this endpoint provides common input across HMIS for CoC vendors to confirm existence of and pull records related to a homeless individual.

Specification

Endpoint

POST /clients

Request

The request body should contain the following fields:

firstName (string, required): The client's first name.
lastName (string, required): The client's last name.
dateOfBirth (string, required): The client's date of birth in YYYY-MM-DD format.
socialSecurityNumber (string, required): The client's social security number.

Example:

{
  "firstName": "John",
  "lastName": "Doe",
  "dateOfBirth": "1980-01-01",
  "socialSecurityNumber": "123-45-6789"
  ... other existing client fields
}

Response

The response will return a client ID, either an existing client ID if a match is found or a new partial client ID if no match exists.

clientId (string): The client ID.
.. existing client fields

Error Handling

The API should handle the following error scenarios:

400 Bad Request: If any required fields are missing or invalid.
500 Internal Server Error: If there is an unexpected error during processing.

HUD-Data-Lab commented 3 months ago

Draft language:

Deduplication

As outlined in the HMIS Data Dictionary, it is critical that HMIS and comparable database implementations have the capability to deduplicate client records. Whenever possible, it is recommended that there is a 1:1 relationship between an individual and a Personal ID. If duplicate records are created, they should be merged into a single record unless this would violate a privacy constraint.

Regardless of whether an individual has multiple records in the system, internal deduplication for the purposes of the Personal Identifier ensures that they are only counted once in reporting. External deduplication is needed in data integration projects to ensure that clients in the incoming data set who already have a record in the receiving system see the new information added to their existing record, rather than having an entirely new Personal ID assigned to them for the incoming data.

There is significant flexibility in implementing deduplication algorithms. Possible deduplication schemas include, but are by no means limited to, the following:

Schema 1

Exact match on:

Name
Social Security Number
Date of Birth
Race and Ethnicity
Gender
Veteran Status
Schema 2

Exact match on:
Last four digits of Social Security Number
Date of Birth

Fuzzy match on:

Name

Exact match on two or more of:

Race and Ethnicity
Gender
Veteran Status
Schema 3

Exact match on:
Name
Date of Birth

AND EITHER:

Exact match on:
- Last four digits of Social Security Number
OR
Social Security Number is null AND
Perfect match on two or more of:
- Race and Ethnicity
- Gender
- Veteran Status

Vendors are encouraged to choose a robust deduplication method and to be transparent with their communities about how this deduplication is performed.

eric-jahn-bitfocus commented 3 months ago

We have a few suggestions regarding deduplication-related changes and identifiers.

1. Distinguish a client ID from Personal Identifier (HMIS Data Element 5.08)

RFC 7 appears to use the unique identifier for a record (“clientID”) interchangeably with the unique identifier for a person (“personalID”; HMIS Data Element 5.08).

For context: Clarity assigns a unique ClientID to each client record and uses Personal Identifier (5.08) to deduplicate across client records. Much as a patient may have distinct records with multiple providers linked together by a common ID; clients in Clarity often have multiple records maintained by various providers linked by a common Personal Identifier.

A user’s visibility into individual records is governed by their sharing and access rights. For example, two programs may share a common client (with a single Personal Identifier), but due to strict sharing rules, be unable to access the others’ records (each with a unique ClientID). In other words, each program has no way of knowing their client is being served by another program.

A Personal Identifier (5.08) linked to multiple client records allows for deduplication across organizations and accurate client counts.

This distinction presents a few challenges:

a. Identifying which client ID to act on

A ClientID must be specified to access or edit a record. Where a Personal Identifier (5.08) is associated with multiple ClientIDs, the call must specify which of the linked records to use. The current version of the draft API directly writes to the Personal Identifier demographics, which is something controlled by the HMIS deduplication algorithm, comparing all records, likely inaccessible to the current API user, see screenshot below.

put2 excerpt from v1.0.0 of draft HUD Data Lab API

Instead, the draft API should have methods to write instead to the specific linked ClientID, and not the PersonalID.

b. Impermanence of Personal ID (5.08) values

System administrators can manually manage Personal Identifiers (linking, unlinking, and relinking one or more ClientIDs). Unless the HMIS stores all linking history, this relinking could break subsequent API calls for an outdated Personal Identifier (5.08). Moreover, match algorithms could also be improved, which could cause Personal Identifier (5.08) relinking as well.

Instead, a globally unique clientID should serve as the API-editable and durable identifier for a homeless client’s record, bounded also by project sharing rules.

2. Support (at a minimum) the additional fields required for HUD deduplication

This was likely an intentional omission falling under “ ... other existing client fields”, but the following additional identifiers and demographics should be included for deduplication.

a. Name

include Middle (3.01.2)
include Suffix (3.01.4)
b. Race and Ethnicity

c. Gender

d. Veteran Status

Note: SSN should also be able to handle empty value placeholders (often a single character ‘x’).

We encourage HUD to consider support for all client fields, including customer-defined data elements.

3. Obtain the User Identifier (5.07) from the authentication process

Many of the draft API methods contain the “userId” property. For security and accountability reasons, the acting single-user context should instead be obtained during authentication, instead of being read from the request body.

a. Sharing and Permissions reasons

Role-based access control implemented within an HMIS requires each API call credential’s access to be scoped and maintained independently. For example, project sharing rules may often not permit a given HMIS user to access any/all of a given Personal Identifier's linked client IDs, or certain areas of a client's record.). Acting on behalf of a different User Identifier than the authenticated User Identifier presents a security risk, so there should be no need to declare a userId property (it will always be the authenticated User Identifier).

b. Compliance with User Identifier (5.07) for write actions.

Unless an entire API call/token is mapped to a single named HMIS user, we cannot accurately log user actions as required by element 5.07.

eanders commented 3 months ago

I'll second a need for awareness of varying levels of access. In the case of client de-duplication, it may be necessary to specify a project at the time of creation to disambiguate the client on the HMIS side.

My reading of the RFC is that the client ID returned by the API does not need to match an existing ID in the system, it could be a value attached to an existing client used specifically for the API, but it does hide some complexity in the implementation that @eric-jahn-bitfocus is calling out.

I agree with @eric-jahn-bitfocus as well that the UserID should be fetched from the authorization, however, there may be more complexity there since this could be a system to system (API) communication, and the user submitting the data may not have setup the integration and therefore may not necessarily exist in the HMIS or be the user associated with the credentials.

eric-jahn-bitfocus commented 3 months ago

Good thoughts all-around @eanders. For system-to-system syncing, we could specify a different OAuth 2 flow (for a user not acting on behalf of themself, but rather on behalf others), where the user IDs are required within the request body. It could function a lot like how the user metadata in HMIS CSV and XML currently functions.

kaungst commented 3 months ago

@eric-jahn-bitfocus thanks for the suggestions!

Re: 1 What your describing re: the usage of clientId vs. personalId makes sense, especially given sharing rules across programs.

Do you think it would make sense to propose the following?

adding a clientId to take the place of a unique identifier for the Client resource
relaxing the uniqueness constraint to a uniqueness of a person within the system
- unsure if it would then be useful to have a "Person" object. maybe not since the part of the problem is intentional obfuscation of an individual existing within HMIS to support privacy rules

Doing so might also help with the impermanence of personal ids, hopefully with minimal disruption to other resources?

Re: 2 Agree that it'd be useful to allow additional information to be sent to help with de-duplication. Maybe they could be listed as optional? Also want to define a minimum set of fields that can be used across the board for de-duplication and ensure those are required

Re: 3 Might be worth spinning that out into a separate issue?

HUD-Data-Lab / Data.Exchange.and.Interoperability

[RFC] Update POST /clients to handle deduplication on intake #7