CC-HIC / ccanonym

Critical care data anonymisation package
1 stars 0 forks source link

Differentiating between removed data and missing data #28

Closed tompollard closed 7 years ago

tompollard commented 7 years ago

When de-identification methods are applied (e.g. k-anonymity), data is removed and replaced with NA. It may be helpful to distinguish between data that was not available (NA) and data that was removed ().

sinanshi commented 7 years ago

The default missing data is usually "NULL" and the data removed in anonymisation process is . We can differentiate it, but the users will not able to see either missing data or removed data in the RData file they received. We think it might be more secure than the other way around.

docsteveharris commented 7 years ago

@tompollard. thanks but we think this is actually part of the security and 'missing' is counted as category in l-diversity so if we add a further 'removed' category we will artificially inflate l-diversity

however we do have a plan to impute missing data so that the release is more usable but also more secure (now you don't know whether a visible value is real or imputed)