arx-deidentifier / arx

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.
http://arx.deidentifier.org/
Apache License 2.0
620 stars 213 forks source link

How to deactivate default removal of records at risk #248

Closed jonasruelfing closed 5 years ago

jonasruelfing commented 5 years ago

While trying to calculate the risk of the bottom ([0,0,0,0]) node (JournalistRisk), there seem to be some irregularities. In my understanding, the bottom node should always be the input data set (unanonymized) , and therefore always having the same risk (1/size of equivalence classes). When changing the parameters of the anonymization (for example k), it seems like the calculation of the size of the equivalence classes for the bottom node doesn't work properly anymore.

Simplified code:

ARXResult result; // result of the anonymization ARXNode bottom = result.getLattice().getBottom(); risk=result.getOutput(bottom,false).getRiskEstimator().getSampleBasedReidentificationRisk().getEstimatedJournalistRisk();

When executing this code multiple times with the same input data but different anonymization parameters, shouldn't the value be always the same?

prasser commented 5 years ago

Hi Jonas,

thanks for your interest in ARX!

No, in the default configuration the value is not necessarily the same. As a "security feature" output datasets will always be transformed in such a manner that the specified privacy guarantees are fulfilled.

So, if you, for example, specify k-anonymity with varying k, the output dataset will change (also for the bottom node).

You can change this behavior by calling ARXConfiguration.setSuppressionAlwaysEnabled(false), but I don't recommend to do that.

Best Fabian

jonasruelfing commented 5 years ago

Hi Fabian,

Thanks a lot for your clarifications! That makes a lot of sense now.

/jonas