ddsjoberg / gtsummary

Presentation-Ready Data Summary and Analytic Result Tables
http://www.danieldsjoberg.com/gtsummary
Other
1.05k stars 125 forks source link

Feature request: Automated suppression of small counts #1691

Closed barretmonchka closed 4 months ago

barretmonchka commented 5 months ago

Data privacy regulations in different jurisdictions requires the suppression of small counts. For example, within the Canadian Province of Manitoba, we suppress all counts between 1 and 5 to avoid having any individuals identified in knowledge dissemination outputs (e.g., papers, presentations, conference posters). Not only are counts less than 6 required to be suppressed, but descriptive tables must be constructed in such a way that small counts cannot be calculated using column headers, row totals, table titles, or other information in the table. The threshold (e.g., <6) for which values should be suppressed varies by geographic region.

For example, the following table violates data privacy regulations in Manitoba by reporting counts less than six:

Summary of cohort (N=180) Characteristic N (%)
Age
<18 3 (1.7%)
18+ 177 (98.3%)

The following table also violates data privacy since we can calculate the number of individuals less than 18 years of age based on the table title and number of individuals 18 or older. In this case, the number of individuals less than 18 can be calculated to be 180-177=3. The count is suppressed to be "<6" and the percent suppressed to be "<3.3%" since 6/180=3.3%.

Summary of cohort (N=180) Characteristic N (%)
Age
<18 <6 (< 3.3%)
18+ 177 (98.3%)

To properly suppress the previous table, we would need to suppress both categories.

Summary of cohort (N=180) Characteristic N (%)
Age
<18 <6 (< 3.3%)
18+ >=174 (>=96.7%)

If we have more than two age categories, the two smallest categories could be suppressed to conform to data privacy requirements.

Summary of cohort (N=200) - doesn't adhere to data privacy regulations Characteristic N (%)
Age
<18 <6 (< 3.0%)
18-29 177 (88.5%)
30+ 20 (10%)
Summary of cohort (N=200) - meets data privacy regulations Characteristic N (%)
Age
<18 <6 (< 3.0%)
18-29 177 (88.5%)
30+ <23 (<11.5%)

We also need to consider whether small counts could be calculated using the column totals:

Summary of cohort (N=200), stratified by treatment status Characteristic Treated (N=98, 49%) Not treated (N=102, 51%)
Age
<18 8 (8.2%) 6 (5.9%)
18-29 80 (81.6%) 94 (92.2%)
30+ 12 (12.2%) 2 (2.0%)

Due to the column total being displayed, the two smallest values could be suppressed to adhere to data privacy legislation:

Summary of cohort (N=200), stratified by treatment status Characteristic Treated (N=98, 49%) Not treated (N=102, 51%)
Age
<18 8 (8.2%) <8 (<7.8%)
18-29 80 (81.6%) 94 (92.2%)
30+ 12 (12.2%) <6 (<5.9%)

However, if row totals were reported, then additional logic would be needed.

Note that it's acceptable to report zero counts:

Summary of cohort (N=180) Characteristic N (%)
Age
<18 0
18+ 180 (100.0%)

Automated suppression logic needs to consider whether the following information is being reported: 1) row and/or column percents, 2) row and/or column totals, 3) table titles, 4) number of categories.

This is a feature we require of the software libraries we use to generate descriptive tables. An acceptable solution would be replacing small counts with the threshold value. For example, replacing all counts that are less than 6 with "6"

All edge cases will need to be documented prior to implementation, but I wanted to first check if implementing automated suppression of small counts was of interest to the developers and the user community, and whether any work on this has already begun.

Also, see these additional resources: https://www.irb-cisr.gc.ca/en/statistics/Pages/small-value-suppression.aspx https://www.cdc.gov/cancer/uscs/technical_notes/stat_methods/suppression.htm https://www.health.nsw.gov.au/hsnsw/Publications/privacy-small-numbers.pdf

ddsjoberg commented 5 months ago

Dear @barretmonchka , thanks for the post. 🍁

I think these two stackoverflow posts will get you started https://stackoverflow.com/questions/76258280 https://stackoverflow.com/questions/71954590

To be honest, I think those solutions get you most of the way there, but it would be difficult to cover every case you mentioned.

BUT, the updates that are coming in the next release of gtsummary, could make your request somewhat trivial. I am hoping to make a soft release in mid-June to request feedback from the community about the updates before submitting the updated version to CRAN.

In the meantime, you could check out the card_summary() function in the v2.0 branch here on GitHub. Essentially, what you'll do is create an Analysis Results Dataset (ARD), then update the formatting functions to convert the statistics into strings that meet your standards. You'd then pass that ARD to card_summary() to create the table. This work is in beta, so please refrain from making feature requests or bug reports on that work until the soft release is made (when I merge v2.0 branch into main).

ddsjoberg commented 4 months ago

Hi @barretmonchka ,

FYI we're about to make our soft release. The card_summary() has been renamed to tbl_ard_summary(). The general approach mentioned above still holds: you calculate your summary stats and can change the formatting functions to mask small counts when present. Happy coding! https://www.danieldsjoberg.com/gtsummary/dev/reference/tbl_ard_summary.html