arx-deidentifier / arx

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.
http://arx.deidentifier.org/
Apache License 2.0
623 stars 216 forks source link

[ENHANCEMENT] Make use of functional hierarchies more transparent in the UI #433

Open jenno-verdonck opened 1 year ago

jenno-verdonck commented 1 year ago

Describe the bug ARX gives different optimal solutions when using hierarchyBuilders in comparison to using the hierarchy created from this builder.

To Reproduce Steps to reproduce the behavior:

  1. Open the example project in ARX
  2. Anonymize and note down the best node
  3. write all hierarchies to CSV
  4. Load all hierarchies back in from CSV so that you no longer use builders
  5. Note the different solution.

Expected behavior I expected to get the same solution in both situations.

Files example.zip

ARX GUI (please complete the following information):

prasser commented 1 year ago

Thanks. This issue doesn't contain enough info to understand the potential bug. Please provide further details.

prasser commented 1 year ago

PS: I'm pretty sure that this isn't a bug but intended behavior, but to be sure and to explain what is going on I need more details.

jenno-verdonck commented 1 year ago

Yea my bad. I accidently posted the report already before finishing it.

prasser commented 1 year ago

OK, thanks. As already suspected, this is not a bug but expected behaviour. In ARX, hierarchies that have been generated using the builders are assicated with a "functional definition" of the hierarchy as meta-information. This information can be used to more accurately measure information loss. One example:

Assume you have a dataset with an integer attribute. In the records, you have three values: 1, 3 and 7.

When using an interval-based hierarchy builder, you specify the interval [0, 10[. As a result, ARX knows that [0, 10[ is a generalization of 10 integer values and might, e.g., estimate information loss as 1/10 = 0.1

When loading a hierarchy from a CSV file, ARX cannot "understand" what the entries in the hierarchy mean. In the case of our example, it can just see that "[0, 10[" is a generalization of 1, 3 and 7 and might, e.g., estimate information loss as 1/|{1, 3, 7}| = 1/3 = 0.33

You can also save and load the functional definitions of hierarchies in the wizards, using the "Save..." and "Load..." buttons.

jenno-verdonck commented 1 year ago

Thanks for the clarification.

I already suspected something like this. I can however see how this may be confusing for some users that expect the same result when visually seeing the same hierarchy in the GUI.

Calculating the score like it is done using the csv files seems to make more sense to me as it take into account the properties of the used dataset and more accurately reflects the score specific to the dataset. I suspect that therefor the utility of the dataset obtained using CSV files will be higher.

prasser commented 1 year ago

Calculating the score like it is done using the csv files seems to make more sense to me as it take into account the properties of the used dataset and more accurately reflects the score specific to the dataset. I suspect that therefor the utility of the dataset obtained using CSV files will be higher.

Not sure. I think this depends on the context and use case.

I already suspected something like this. I can however see how this may be confusing for some users that expect the same result when visually seeing the same hierarchy in the GUI.

I turned this issue into an "enhancement". We could make the fact whether a functional definition of a hierarchy is available and should be used more transparent in the UI. Please note that you can remove the functional representations, by manually editing the hierarchy in the hierarchy viewer (not in the wizard) as a workaround.

idhamari commented 1 year ago

What about expoerting and importing the finctional definition of the hierarchies at the same event of the hierarchies. This way, if functional definition is available, it can be used for more accurate loss calculation and one gets same result everytime.

jenno-verdonck commented 1 year ago

What about expoerting and importing the finctional definition of the hierarchies at the same event of the hierarchies. This way, if functional definition is available, it can be used for more accurate loss calculation and one gets same result everytime.

This would probably solve the import/export problems in the UI. A fix for this in the API could be to disable the user from building the HierarchyBuilder themselves or giving a warning when doing so. This would avoid scenarios where the user builds the Hierarchy and passes the result to the configuration, removing the functional definition. At the moment a user could do this without the knowledge of the difference between Hierarchies and HierarchyBuilders.

Another option would be to merge the hierarchy and builder representation and working with a toggle that enables or disables the functional definition when available. This would however require a mayor restructure I think.

idhamari commented 1 year ago

This would probably solve the import/export problems in the UI.

I think one can do the same in the API e.g. saving both hierarchy and functional definition then load them. I will try the above solution and propose a PR.

jenno-verdonck commented 1 year ago

After investigating this behavior a bit further. I noticed that the code only calculates the shares in the scoring functions differently when using Redaction- and Interval-based builders. All other builder types are calculated identically to not having a functional definition. The utility metrics, on the other hand, are only calculate differently when using a Redaction-based builders.

prasser commented 1 year ago

It's true that not all utility models make use of additional info provided by functional hierarchies and that not all hierarchy types provide such information.