Closed RickHawesUSDS closed 2 years ago
@brick-green-agile6 @MauriceReeves-usds this is the ticket I flagged in Slack last week. This is high priority and Rick asked that we complete it this sprint.
Acceptance criteria for this ticket:
deidentify
function in Report
updates the patient_zip
according to the rules in the referenced doc above (https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#zip)
a. The last two digits of ALL US zip codes must be replaced with 00
, so for example, zip code 22191
, is converted to 22000
b. If the geographic region designated by those first three digits contains 20,000 or less residents, then the entire US zip code MUST be replaced with all zeros. For example, 03603
is within a geographic unit with less than 20,000 residents, therefore it must be changed to 00000
.patient_zip
is updated with this functionality, and only within the deidentify
methodcovid_result_metadata
table to apply this same logic to new records we're storing therecovid_result_metadata
that will apply this same logic to the data there.NOTE: The 17 restricted zip codes in the HHS documentation are from the 2000 census data and are almost certainly not correct. Therefore, before committing this code, please find and use the most up-to-date census data for geographic units that have 20,000 or less residents for a three digit zip code.
@RickHawesUSDS curious if you know more - do we need to change the zip code per the Safe Harbor rules or is it okay to always blank it out - for example just turn everything to 00000 always? Do they use the two digit code for the non-restricted zip code for anything?
@RickHawesUSDS @sarahnw another open question on this is if we also need to do this for non-US addresses. We actually are receiving quite a bit of data for non-US patients now, which means we need to be deliberate about how we treat those as well.
SimpleReport supposed 5-4+ formats e.g. 12345-6789
so, we need to strip out the -###.
It also supports intl zip codes. Canadian zip codes look like STHUBT 1M1S
Getting the population of zip codes using Census data is a known hard problem ( See Trouble with Zip Codes] Basically the boundaries of zip code regions don't always match the boundaries of census tracts.
If we want to take it on, then these are useful: https://www.census.gov/geographies/reference-files/2020/geo/relationship-files.html#zctacomp and https://udsmapper.org/zip-code-to-zcta-crosswalk/
I've been trying to parse the guidance. It's not entirely clear if the resulting de-indent results in prefixed or suffixed zeros (e.g. ##000
or 000##
).
This translation of the rule helps:
All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
Q: why does it say 000
and not 00000
? This is just confusing.
You'd also think that since this has been required for the past 22 years, it would be easy to find lots of example code and/or rules based off of more recent census data... but it isn't.
The set of ###**
prefixes with population < 20k is a pretty short list. The outdated list of these prefixes from 2000 is 17 items. We could generate an updated prefix list, but the issue is that any NEW zip codes that are created between 2020-2030 might be too small but would NOT included in the "population-too-small" set.
So, really, it should be a set of zip-code-prefixes known to be big enough. There are 43k zip codes, but the prefix reduces that down to a maximum of 999 items... still a pretty large set, but not unreasonable for a hard-coded btree lookup. Bloom filters might also be a good match, but are less standard.
I used zipcode-to-population data from https://simplemaps.com/data/us-zips (Creative Commons Attributed).
I created a Jupyter Notebook to process the data and here are the results up to population sums of 30k.
sum_for_prefix | zip_prefix | state_id |
---|---|---|
0 | 969 | MP |
0 | 753 | TX |
0 | 205 | DC |
0 | 204 | DC |
0 | 969 | GU |
0 | 202 | DC |
0 | 772 | TX |
0 | 008 | VI |
1088 | 203 | DC |
1183 | 821 | WY |
3559 | 059 | VT |
8531 | 692 | NE |
11221 | 893 | NV |
13374 | 036 | NH |
15413 | 823 | WY |
15554 | 102 | NY |
15781 | 556 | MN |
15964 | 879 | NM |
16235 | 884 | NM |
16269 | 878 | NM |
17340 | 369 | AL |
21643 | 576 | SD |
22098 | 999 | AK |
22110 | 830 | WY |
22680 | 994 | WA |
23019 | 516 | IA |
23249 | 831 | WY |
23591 | 669 | KS |
24031 | 414 | KY |
24229 | 822 | WY |
24755 | 690 | NE |
28004 | 739 | TX |
28004 | 739 | OK |
28282 | 418 | KY |
28606 | 051 | VT |
29428 | 679 | KS |
29917 | 266 | WV |
30300 | 677 | KS |
Created PR for all base development. @jorg3lopez will add unit test cases and @greene858 will add a flyway script to de-identify all existing patient_zip in the database.
Another interesting thought.
https://en.wikipedia.org/wiki/List_of_ZIP_Code_prefixes goes through ALL 3 digit zip code prefixes.
There are 18 military bases that have their own zipcodes. The populations of these zip codes are not published. We should probably set them all to 00000
090**-099**
, 340**
, 962**-966**
Again, if we flipped the table to A) only-allow-large-enough-zip-codes versus the simpler approach of B) remove-zip-codes-that-are-too-small.
A) is 880x 3-digit items B) is 22x 3-digit items
While A) is larger, it default-rejects and would catch zip codes that B) misses.
That's a really interesting point. I'd really like at least one SME to guide us on this; it seems likely that some of those codes have populations > 20k: for example 962 and 963 are bases in Korea and Japan respectively. Does that meet safe harbor criteria? Does it matter to the end user for this de-identified data which country a result came from?
That's a really interesting point. I'd really like at least one SME to guide us on this; it seems likely that some of those codes have populations > 20k: for example 962 and 963 are bases in Korea and Japan respectively. Does that meet safe harbor criteria? Does it matter to the end user for this de-identified data which country a result came from?
Agreed.
Thinking more on the algorithm aspect once we figure out the list:
There's a maximum of 999x 3-digit-prefixes. Let's assume that 880x are known to be large-enough zip codes (this number may change based on if we include large military bases, but the magnitude is approx correct).
That leaves ~119 unknown or known-to-be-too-small prefixes. This is a reasonably sized list to check against.
If we convert the disallow-list to an [int16]
, then it's even smaller! Once this static list is sorted, then checking a value becomes an O(log n) basic Binary Search operation where n=~119. Ideally, this would be done in code to avoid the overhead of hitting a database.
De-identifying existing zip codes already in tables efficiently is something we still need to figure out. But ~119 int array is pretty reasonable.
Update: Working on the integration test, paused at the moment due to resends.
This is extremely valuable work. Would it be possible to ask that the program that generated your list of zip codes be published so that we could reference it? Is the following listing your final determination as of 2022?
Note that 969
occurs twice in this list. Is this intentional?
This is extremely valuable work. Would it be possible to ask that the program that generated your list of zip codes be published so that we could reference it? Is the following listing your final determination as of 2022?
Note that
969
occurs twice in this list. Is this intentional?
Hello Clark,
The project does not have the script you are referencing anymore (it was never version controlled). The project is will be working with the CDC soon to determine a solution for the zip code issue. We may just end up de-identifying all patient zip codes instead of creating a list of small pop zip codes and referencing that.
Describe the bug
HHS has a guideline (https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#zip) for de-identifying zip codes for HIPPA purposes. ReportStream should follow this HHS guideline.
Impact We are sending data to HHS protect that is not properly de-identified.
To Reproduce Look at the feed to HHS protect and notice that there are 5 digit zipcodes.
Expected behavior See
deidentify()
inReport.kt
for the deidentify code.000
covid metadata
table.Screenshots If applicable, add screenshots to help explain your problem.
Logs If applicable, please attach logs to help describe your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Add any other context about the problem here.