Closed MattStammers closed 1 year ago
Awesome. Looks like there are quite a few options for SDC described here https://github.com/AI-SDC/ACRO/blob/main/notebooks/test.ipynb. @quindavies has written some code that does cell suppression at the end of our pipeline but will be good to see both approaches.
As always PRs welcome :-).
This is how we have implemented this locally
# Set columns to not low number suppress - if needed
keep_cols = ['patient_id']
relevant_cols = list(
set(df.columns.to_list()) - set(keep_cols)
)
# Function to Suppress
def supress_lownum(col, threshold, relevant_cols):
# for row, col in relevant_cols:
if col.name in relevant_cols:
counts = col.value_counts()
to_supress = counts[counts <= threshold].index
return col.replace(to_supress, np.nan)
return col
# Set Threshold and apply
THRESHOLD = 5
df_supressed = df.apply(supress_lownum, relevant_cols=relevant_cols, threshold=THRESHOLD)
I have done a pull request #38
Based on further discussions surrounding this an alternative option would be to simply suppress the final aggregated dataset at the n<5 or 10 level but this comes with it other issues. We have done this for the chest pain data.
Closed per discussion in #38
I have managed to make the validator work with the ACRO package to suppress all n<=10. We may want to add this as a feature but it needs a decision by the central teams before it can go ahead.