Improve Zip Code de-identification

RickHawesUSDS commented 2 years ago

Describe the bug

HHS has a guideline (https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#zip) for de-identifying zip codes for HIPPA purposes. ReportStream should follow this HHS guideline.

Impact We are sending data to HHS protect that is not properly de-identified.

To Reproduce Look at the feed to HHS protect and notice that there are 5 digit zipcodes.

Expected behavior See deidentify() inReport.kt for the deidentify code.

Read and understand the HHS rule and its implication for ReportStream.
Confirm that this rule only applies to a patient's zipcode.
Research the proper way change a five-digit zipcode to 3 digits. (I think it is just zero out the last 2 digits, but it needs research)
Ensure that restricted 3-digit zipcodes are changed to 000
Ensure that 3-digit zipcodes are passed to HHS protect.
Ensure that 3-digit zipcodes are stored in the covid metadata table.
UnitTests updated to properly handle zipcodes .

Screenshots If applicable, add screenshots to help explain your problem.

Logs If applicable, please attach logs to help describe your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context Add any other context about the problem here.

Adrian-Brewster commented 2 years ago

@brick-green-agile6 @MauriceReeves-usds this is the ticket I flagged in Slack last week. This is high priority and Rick asked that we complete it this sprint.

MauriceReeves-usds commented 2 years ago

Acceptance criteria for this ticket:

The deidentify function in Report updates the patient_zip according to the rules in the referenced doc above (https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#zip) a. The last two digits of ALL US zip codes must be replaced with 00, so for example, zip code 22191, is converted to 22000 b. If the geographic region designated by those first three digits contains 20,000 or less residents, then the entire US zip code MUST be replaced with all zeros. For example, 03603 is within a geographic unit with less than 20,000 residents, therefore it must be changed to 00000.
Only patient_zip is updated with this functionality, and only within the deidentify method
Update the method to save results into the covid_result_metadata table to apply this same logic to new records we're storing there
A flyway script is created that will update all of the existing records in covid_result_metadata that will apply this same logic to the data there.
The zip codes that this applies to are not hardcoded into the logic, but instead come from a table loaded in the DB using our new table logic like LIVD data, etc
Lots of tests written for this
Documentation for how to update the table to add and remove restricted zip codes

NOTE: The 17 restricted zip codes in the HHS documentation are from the 2000 census data and are almost certainly not correct. Therefore, before committing this code, please find and use the most up-to-date census data for geographic units that have 20,000 or less residents for a three digit zip code.

sarahnw commented 2 years ago

@RickHawesUSDS curious if you know more - do we need to change the zip code per the Safe Harbor rules or is it okay to always blank it out - for example just turn everything to 00000 always? Do they use the two digit code for the non-restricted zip code for anything?

MauriceReeves-usds commented 2 years ago

@RickHawesUSDS @sarahnw another open question on this is if we also need to do this for non-US addresses. We actually are receiving quite a bit of data for non-US patients now, which means we need to be deliberate about how we treat those as well.

TomNUSDS commented 2 years ago

SimpleReport supposed 5-4+ formats e.g. 12345-6789 so, we need to strip out the -###. It also supports intl zip codes. Canadian zip codes look like STHUBT 1M1S

TomNUSDS commented 2 years ago

Getting the population of zip codes using Census data is a known hard problem ( See Trouble with Zip Codes] Basically the boundaries of zip code regions don't always match the boundaries of census tracts.

If we want to take it on, then these are useful: https://www.census.gov/geographies/reference-files/2020/geo/relationship-files.html#zctacomp and https://udsmapper.org/zip-code-to-zcta-crosswalk/

TomNUSDS commented 2 years ago

I've been trying to parse the guidance. It's not entirely clear if the resulting de-indent results in prefixed or suffixed zeros (e.g. ##000 or 000##).

This translation of the rule helps:

All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

source

Q: why does it say 000 and not 00000? This is just confusing.

You'd also think that since this has been required for the past 22 years, it would be easy to find lots of example code and/or rules based off of more recent census data... but it isn't.

The set of ###** prefixes with population < 20k is a pretty short list. The outdated list of these prefixes from 2000 is 17 items. We could generate an updated prefix list, but the issue is that any NEW zip codes that are created between 2020-2030 might be too small but would NOT included in the "population-too-small" set.

So, really, it should be a set of zip-code-prefixes known to be big enough. There are 43k zip codes, but the prefix reduces that down to a maximum of 999 items... still a pretty large set, but not unreasonable for a hard-coded btree lookup. Bloom filters might also be a good match, but are less standard.

TomNUSDS commented 2 years ago

I used zipcode-to-population data from https://simplemaps.com/data/us-zips (Creative Commons Attributed).

I created a Jupyter Notebook to process the data and here are the results up to population sums of 30k.

sum_for_prefix	zip_prefix	state_id
0	969	MP
0	753	TX
0	205	DC
0	204	DC
0	969	GU
0	202	DC
0	772	TX
0	008	VI
1088	203	DC
1183	821	WY
3559	059	VT
8531	692	NE
11221	893	NV
13374	036	NH
15413	823	WY
15554	102	NY
15781	556	MN
15964	879	NM
16235	884	NM
16269	878	NM
17340	369	AL
21643	576	SD
22098	999	AK
22110	830	WY
22680	994	WA
23019	516	IA
23249	831	WY
23591	669	KS
24031	414	KY
24229	822	WY
24755	690	NE
28004	739	TX
28004	739	OK
28282	418	KY
28606	051	VT
29428	679	KS
29917	266	WV
30300	677	KS

See gist for notebook and full data

oslynn commented 2 years ago

Created PR for all base development. @jorg3lopez will add unit test cases and @greene858 will add a flyway script to de-identify all existing patient_zip in the database.

TomNUSDS commented 2 years ago

Another interesting thought.

https://en.wikipedia.org/wiki/List_of_ZIP_Code_prefixes goes through ALL 3 digit zip code prefixes.

There are 18 military bases that have their own zipcodes. The populations of these zip codes are not published. We should probably set them all to 00000

090**-099**, 340**, 962**-966**

TomNUSDS commented 2 years ago

Again, if we flipped the table to A) only-allow-large-enough-zip-codes versus the simpler approach of B) remove-zip-codes-that-are-too-small.

A) is 880x 3-digit items B) is 22x 3-digit items

While A) is larger, it default-rejects and would catch zip codes that B) misses.

greene858 commented 2 years ago

That's a really interesting point. I'd really like at least one SME to guide us on this; it seems likely that some of those codes have populations > 20k: for example 962 and 963 are bases in Korea and Japan respectively. Does that meet safe harbor criteria? Does it matter to the end user for this de-identified data which country a result came from?

TomNUSDS commented 2 years ago

@greene858:

That's a really interesting point. I'd really like at least one SME to guide us on this; it seems likely that some of those codes have populations > 20k: for example 962 and 963 are bases in Korea and Japan respectively. Does that meet safe harbor criteria? Does it matter to the end user for this de-identified data which country a result came from?

Agreed.

Thinking more on the algorithm aspect once we figure out the list:

There's a maximum of 999x 3-digit-prefixes. Let's assume that 880x are known to be large-enough zip codes (this number may change based on if we include large military bases, but the magnitude is approx correct).

That leaves ~119 unknown or known-to-be-too-small prefixes. This is a reasonably sized list to check against.

If we convert the disallow-list to an [int16], then it's even smaller! Once this static list is sorted, then checking a value becomes an O(log n) basic Binary Search operation where n=~119. Ideally, this would be done in code to avoid the overhead of hitting a database.

De-identifying existing zip codes already in tables efficiently is something we still need to figure out. But ~119 int array is pretty reasonable.

jorg3lopez commented 2 years ago

Update: Working on the integration test, paused at the moment due to resends.

clarkevans commented 7 months ago

This is extremely valuable work. Would it be possible to ask that the program that generated your list of zip codes be published so that we could reference it? Is the following listing your final determination as of 2022?

https://github.com/CDCgov/prime-reportstream/blob/master/prime-router/metadata/tables/local/restricted_zip_code.csv

Note that 969 occurs twice in this list. Is this intentional?

arnejduranovic commented 7 months ago

This is extremely valuable work. Would it be possible to ask that the program that generated your list of zip codes be published so that we could reference it? Is the following listing your final determination as of 2022?

https://github.com/CDCgov/prime-reportstream/blob/master/prime-router/metadata/tables/local/restricted_zip_code.csv

Note that 969 occurs twice in this list. Is this intentional?

Hello Clark,

The project does not have the script you are referencing anymore (it was never version controlled). The project is will be working with the CDC soon to determine a solution for the zip code issue. We may just end up de-identifying all patient zip codes instead of creating a list of small pop zip codes and referencing that.

CDCgov / prime-reportstream

Improve Zip Code de-identification #5232