LTHTR-DST / hdruk_avoidable_admissions

HDRUK Data Science Collaboration on Avoidable Admissions in the NHS.
https://lthtr-dst.github.io/hdruk_avoidable_admissions/
MIT License
6 stars 5 forks source link

Low number suppression added #38

Closed MattStammers closed 1 year ago

MattStammers commented 1 year ago

Added low-number suppression in demo pipeline

vvcb commented 1 year ago

Thanks for this @MattStammers . This applies low number suppression on the entire data frame and not on the output tables. The latter is tricky to achieve.

For instance, Penal is one of the categories in Discharge Destination. Let's say there are 6 patients discharged to this destination. This would not get suppressed with the current method using a threshold of 5.

However, when cross tabulating Discharge Destination against ACSC vs non-ACSC, we may find that of these 6 patients, 1 patient had ACSC while 5 did not. This will still be visible in the output. If we are applying SDC, it should be at this stage.

A simplistic solution would be to just mask the cells with the low numbers. But as @quindavies pointed out, it is trivial to recalculate cell counts from the row totals. An alternative would be to mask the entire row for the Penal category - but it will still be possible to recalculate this from the column totals!

For certain features such as age bands, it may be possible to assimilate low number bins into neighbouring bins. Others will need to be dealt with on a feature by feature basis.

I am not confident we will be able to achieve low number suppression the way it probably should be done and alternative strategies may end up being tokenistic. My preference would be to send the tables as is to Sheffield and for the lead team to combine the tables and apply SDC on the final outputs before publishing but we can discuss on Thursday.

MattStammers commented 1 year ago

You are right @vvcb In the end we have stuck with the above locally and suppressed it to a higher number (in our case 20) to guarantee no n<5 columns. However, as you say we will see what happens with the national decision-making. If no suppression is to be applied we can easily regenerate the dataset