Closed chrisroederucdenver closed 2 weeks ago
I implemented a simple hash of id root and extension for person and visit, and added those values in extra columns as the end of the resulting table/csv/dataset.
added unit tests of HASH field type to test/test_field_types.py.
This technique will be further used in additional sections, etc.
See also #125
The Problem
Different documents have different types of IDs for the patients. We need to pick one to serve two purposes. First is that we need a unique ID for the patient. Second is that we might use that ID to find the patient and match up with other documents that are about the same patient. I don't think choosing the ID types in a certain order will work when not all documents for a given patient have all the IDs (see chart below).
Solutions
The solution is just to pick the first available and make a hash/token of the root and extension. This solves the unique ID problem. If another document comes up with the same, then we assume it's legitimately the same patient.
Solve the linking problem by saving the root and extension fields in array types for later matching and analysis.
more detail in the ID Analysis doc.
Why choosing a single field according to an order won't work for matching documents.
I had to run through a simple proof by contradiction on paper to see why an order does you no good. Assume that each document A-H is about the same person and that all the MRNs are the same, all the SSNs and NPIs too. Then create a hash for each by selecting the first non-null ID field.
Documents A, B, D, and G will match on MRN. C and E on SSN. F won’t match with anyone because they all chose different IDs. E misses out on matching with A or G because they chose SSN first.