LTHTR-DST / hdruk_avoidable_admissions

HDRUK Data Science Collaboration on Avoidable Admissions in the NHS.
https://lthtr-dst.github.io/hdruk_avoidable_admissions/
MIT License
6 stars 5 forks source link

Low Number Suppression #34

Closed MattStammers closed 1 year ago

MattStammers commented 1 year ago

I have managed to make the validator work with the ACRO package to suppress all n<=10. We may want to add this as a feature but it needs a decision by the central teams before it can go ahead.

vvcb commented 1 year ago

Awesome. Looks like there are quite a few options for SDC described here https://github.com/AI-SDC/ACRO/blob/main/notebooks/test.ipynb. @quindavies has written some code that does cell suppression at the end of our pipeline but will be good to see both approaches.

As always PRs welcome :-).

MattStammers commented 1 year ago

This is how we have implemented this locally

# Set columns to not low number suppress - if needed
keep_cols = ['patient_id']
relevant_cols = list(
    set(df.columns.to_list()) - set(keep_cols)
)

# Function to Suppress
def supress_lownum(col, threshold, relevant_cols):
    # for row, col in relevant_cols:
    if col.name in relevant_cols:
        counts = col.value_counts()
        to_supress = counts[counts <= threshold].index
        return col.replace(to_supress, np.nan)

    return col

# Set Threshold and apply
THRESHOLD = 5
df_supressed = df.apply(supress_lownum, relevant_cols=relevant_cols, threshold=THRESHOLD)
MattStammers commented 1 year ago

I have done a pull request #38

MattStammers commented 1 year ago

Based on further discussions surrounding this an alternative option would be to simply suppress the final aggregated dataset at the n<5 or 10 level but this comes with it other issues. We have done this for the chest pain data.

vvcb commented 1 year ago

Closed per discussion in #38