arx-deidentifier / arx

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.
http://arx.deidentifier.org/
Apache License 2.0
623 stars 216 forks source link

[BUG] Running ProfitabilityProsecutor with an already suppressed dataset (GUI and API) #457

Open mhalilovic opened 9 months ago

mhalilovic commented 9 months ago

I encounter an error when anonymizing a fully suppressed dataset using the API, with similar behavior observed in the GUI.

Example to reproduce using the ARX GUI: Import a fully suppressed dataset (all values), applying generalization hierarchies with just one level Configured Profitability Prosecutor with suppression limit of 100%.

When attempting to anonymize, I get the message: Cannot anonymize data: Value (NaN) out of range [0,1]

Description of the API behavior:
The same issue appears to occur when using the API with Java. Here is part of my logs: Caused by: java.lang.IllegalStateException: Value (NaN) out of range [0,1] at org.deidentifier.arx.metric.v2.MetricSDNMEntropyBasedInformationLoss.getEntropyBasedInformationLoss(MetricSDNMEntropyBasedInformationLoss.java:109) at org.deidentifier.arx.criteria.ProfitabilityProsecutor.isAnonymous(ProfitabilityProsecutor.java:121) at org.deidentifier.arx.framework.check.groupify.HashGroupify.isPrivacyModelFulfilled(HashGroupify.java:758) at org.deidentifier.arx.framework.check.groupify.HashGroupify.analyzeWithEarlyAbort(HashGroupify.java:653) at org.deidentifier.arx.framework.check.groupify.HashGroupify.stateAnalyze(HashGroupify.java:447) at org.deidentifier.arx.framework.check.TransformationChecker.check(TransformationChecker.java:217) at org.deidentifier.arx.framework.check.TransformationChecker.check(TransformationChecker.java:170) at org.deidentifier.arx.algorithm.FLASHAlgorithmImpl.traverse(FLASHAlgorithmImpl.java:128) at org.deidentifier.arx.ARXAnonymizer.anonymize(ARXAnonymizer.java:777) at org.deidentifier.arx.ARXAnonymizer.anonymize(ARXAnonymizer.java:226) at org.deidentifier.arx.distributed.ARXWorkerLocal$1.call(Unknown Source) at org.deidentifier.arx.distributed.ARXWorkerLocal$1.call(Unknown Source)

prasser commented 9 months ago

This should be relatively easy to fix. Can you please investigate the semantics of the number [0, 1] usually returned from getEntropyBasedInformationLoss? Is it 0 for no information loss and 1 for maximum information loss, or the other way around (0 for maximum information loss and 1 for no information loss)? Please let me know here.

mhalilovic commented 9 months ago

0 for no information loss and 1 for maximum information loss

prasser commented 9 months ago

Please check whether the recent commit 984f38f fixes the problem.

mhalilovic commented 9 months ago

My issue with the API is resolved. Thank you!

The GUI also "anonymizes" the dataset now without a message. Most quality models have NaN or N/A values in the Quality models tab now. I do not know if this is expected behavior.

prasser commented 9 months ago

Most quality models have NaN or N/A values in the Quality models tab now. I do not know if this is expected behavior.

Are you sure that this is caused by this commit? Please check.