CDCgov / prime-reportstream

ReportStream is a public intermediary tool for delivery of data between different parts of the healthcare ecosystem.
https://reportstream.cdc.gov
Creative Commons Zero v1.0 Universal
72 stars 40 forks source link

Improve Zip Code de-identification #5232

Closed RickHawesUSDS closed 2 years ago

RickHawesUSDS commented 2 years ago

Describe the bug

HHS has a guideline (https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#zip) for de-identifying zip codes for HIPPA purposes. ReportStream should follow this HHS guideline.

Impact We are sending data to HHS protect that is not properly de-identified.

To Reproduce Look at the feed to HHS protect and notice that there are 5 digit zipcodes.

Expected behavior See deidentify() inReport.kt for the deidentify code.

Screenshots If applicable, add screenshots to help explain your problem.

Logs If applicable, please attach logs to help describe your problem.

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context Add any other context about the problem here.

Adrian-Brewster commented 2 years ago

@brick-green-agile6 @MauriceReeves-usds this is the ticket I flagged in Slack last week. This is high priority and Rick asked that we complete it this sprint.

MauriceReeves-usds commented 2 years ago

Acceptance criteria for this ticket:

  1. The deidentify function in Report updates the patient_zip according to the rules in the referenced doc above (https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#zip) a. The last two digits of ALL US zip codes must be replaced with 00, so for example, zip code 22191, is converted to 22000 b. If the geographic region designated by those first three digits contains 20,000 or less residents, then the entire US zip code MUST be replaced with all zeros. For example, 03603 is within a geographic unit with less than 20,000 residents, therefore it must be changed to 00000.
  2. Only patient_zip is updated with this functionality, and only within the deidentify method
  3. Update the method to save results into the covid_result_metadata table to apply this same logic to new records we're storing there
  4. A flyway script is created that will update all of the existing records in covid_result_metadata that will apply this same logic to the data there.
  5. The zip codes that this applies to are not hardcoded into the logic, but instead come from a table loaded in the DB using our new table logic like LIVD data, etc
  6. Lots of tests written for this
  7. Documentation for how to update the table to add and remove restricted zip codes

NOTE: The 17 restricted zip codes in the HHS documentation are from the 2000 census data and are almost certainly not correct. Therefore, before committing this code, please find and use the most up-to-date census data for geographic units that have 20,000 or less residents for a three digit zip code.

sarahnw commented 2 years ago

@RickHawesUSDS curious if you know more - do we need to change the zip code per the Safe Harbor rules or is it okay to always blank it out - for example just turn everything to 00000 always? Do they use the two digit code for the non-restricted zip code for anything?

MauriceReeves-usds commented 2 years ago

@RickHawesUSDS @sarahnw another open question on this is if we also need to do this for non-US addresses. We actually are receiving quite a bit of data for non-US patients now, which means we need to be deliberate about how we treat those as well.

TomNUSDS commented 2 years ago

SimpleReport supposed 5-4+ formats e.g. 12345-6789 so, we need to strip out the -###. It also supports intl zip codes. Canadian zip codes look like STHUBT 1M1S

TomNUSDS commented 2 years ago

Getting the population of zip codes using Census data is a known hard problem ( See Trouble with Zip Codes] Basically the boundaries of zip code regions don't always match the boundaries of census tracts.

If we want to take it on, then these are useful: https://www.census.gov/geographies/reference-files/2020/geo/relationship-files.html#zctacomp and https://udsmapper.org/zip-code-to-zcta-crosswalk/

TomNUSDS commented 2 years ago

I've been trying to parse the guidance. It's not entirely clear if the resulting de-indent results in prefixed or suffixed zeros (e.g. ##000 or 000##).

This translation of the rule helps:

All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

source

Q: why does it say 000 and not 00000? This is just confusing.

You'd also think that since this has been required for the past 22 years, it would be easy to find lots of example code and/or rules based off of more recent census data... but it isn't.

The set of ###** prefixes with population < 20k is a pretty short list. The outdated list of these prefixes from 2000 is 17 items. We could generate an updated prefix list, but the issue is that any NEW zip codes that are created between 2020-2030 might be too small but would NOT included in the "population-too-small" set.

So, really, it should be a set of zip-code-prefixes known to be big enough. There are 43k zip codes, but the prefix reduces that down to a maximum of 999 items... still a pretty large set, but not unreasonable for a hard-coded btree lookup. Bloom filters might also be a good match, but are less standard.

TomNUSDS commented 2 years ago

I used zipcode-to-population data from https://simplemaps.com/data/us-zips (Creative Commons Attributed).

I created a Jupyter Notebook to process the data and here are the results up to population sums of 30k.

sum_for_prefix zip_prefix state_id
0 969 MP
0 753 TX
0 205 DC
0 204 DC
0 969 GU
0 202 DC
0 772 TX
0 008 VI
1088 203 DC
1183 821 WY
3559 059 VT
8531 692 NE
11221 893 NV
13374 036 NH
15413 823 WY
15554 102 NY
15781 556 MN
15964 879 NM
16235 884 NM
16269 878 NM
17340 369 AL
21643 576 SD
22098 999 AK
22110 830 WY
22680 994 WA
23019 516 IA
23249 831 WY
23591 669 KS
24031 414 KY
24229 822 WY
24755 690 NE
28004 739 TX
28004 739 OK
28282 418 KY
28606 051 VT
29428 679 KS
29917 266 WV
30300 677 KS

See gist for notebook and full data

oslynn commented 2 years ago

Created PR for all base development. @jorg3lopez will add unit test cases and @greene858 will add a flyway script to de-identify all existing patient_zip in the database.

TomNUSDS commented 2 years ago

Another interesting thought.

https://en.wikipedia.org/wiki/List_of_ZIP_Code_prefixes goes through ALL 3 digit zip code prefixes.

There are 18 military bases that have their own zipcodes. The populations of these zip codes are not published. We should probably set them all to 00000

090**-099**, 340**, 962**-966**

TomNUSDS commented 2 years ago

Again, if we flipped the table to A) only-allow-large-enough-zip-codes versus the simpler approach of B) remove-zip-codes-that-are-too-small.

A) is 880x 3-digit items B) is 22x 3-digit items

While A) is larger, it default-rejects and would catch zip codes that B) misses.

greene858 commented 2 years ago

That's a really interesting point. I'd really like at least one SME to guide us on this; it seems likely that some of those codes have populations > 20k: for example 962 and 963 are bases in Korea and Japan respectively. Does that meet safe harbor criteria? Does it matter to the end user for this de-identified data which country a result came from?

TomNUSDS commented 2 years ago

@greene858:

That's a really interesting point. I'd really like at least one SME to guide us on this; it seems likely that some of those codes have populations > 20k: for example 962 and 963 are bases in Korea and Japan respectively. Does that meet safe harbor criteria? Does it matter to the end user for this de-identified data which country a result came from?

Agreed.

Thinking more on the algorithm aspect once we figure out the list:

There's a maximum of 999x 3-digit-prefixes. Let's assume that 880x are known to be large-enough zip codes (this number may change based on if we include large military bases, but the magnitude is approx correct).

That leaves ~119 unknown or known-to-be-too-small prefixes. This is a reasonably sized list to check against.

If we convert the disallow-list to an [int16], then it's even smaller! Once this static list is sorted, then checking a value becomes an O(log n) basic Binary Search operation where n=~119. Ideally, this would be done in code to avoid the overhead of hitting a database.

De-identifying existing zip codes already in tables efficiently is something we still need to figure out. But ~119 int array is pretty reasonable.

jorg3lopez commented 2 years ago

Update: Working on the integration test, paused at the moment due to resends.

clarkevans commented 7 months ago

This is extremely valuable work. Would it be possible to ask that the program that generated your list of zip codes be published so that we could reference it? Is the following listing your final determination as of 2022?

https://github.com/CDCgov/prime-reportstream/blob/master/prime-router/metadata/tables/local/restricted_zip_code.csv

Note that 969 occurs twice in this list. Is this intentional?

arnejduranovic commented 7 months ago

This is extremely valuable work. Would it be possible to ask that the program that generated your list of zip codes be published so that we could reference it? Is the following listing your final determination as of 2022?

https://github.com/CDCgov/prime-reportstream/blob/master/prime-router/metadata/tables/local/restricted_zip_code.csv

Note that 969 occurs twice in this list. Is this intentional?

Hello Clark,

The project does not have the script you are referencing anymore (it was never version controlled). The project is will be working with the CDC soon to determine a solution for the zip code issue. We may just end up de-identifying all patient zip codes instead of creating a list of small pop zip codes and referencing that.