Open jenno-verdonck opened 1 year ago
Thanks. This issue doesn't contain enough info to understand the potential bug. Please provide further details.
PS: I'm pretty sure that this isn't a bug but intended behavior, but to be sure and to explain what is going on I need more details.
Yea my bad. I accidently posted the report already before finishing it.
OK, thanks. As already suspected, this is not a bug but expected behaviour. In ARX, hierarchies that have been generated using the builders are assicated with a "functional definition" of the hierarchy as meta-information. This information can be used to more accurately measure information loss. One example:
Assume you have a dataset with an integer attribute. In the records, you have three values: 1, 3 and 7.
When using an interval-based hierarchy builder, you specify the interval [0, 10[. As a result, ARX knows that [0, 10[ is a generalization of 10 integer values and might, e.g., estimate information loss as 1/10 = 0.1
When loading a hierarchy from a CSV file, ARX cannot "understand" what the entries in the hierarchy mean. In the case of our example, it can just see that "[0, 10[" is a generalization of 1, 3 and 7 and might, e.g., estimate information loss as 1/|{1, 3, 7}| = 1/3 = 0.33
You can also save and load the functional definitions of hierarchies in the wizards, using the "Save..." and "Load..." buttons.
Thanks for the clarification.
I already suspected something like this. I can however see how this may be confusing for some users that expect the same result when visually seeing the same hierarchy in the GUI.
Calculating the score like it is done using the csv files seems to make more sense to me as it take into account the properties of the used dataset and more accurately reflects the score specific to the dataset. I suspect that therefor the utility of the dataset obtained using CSV files will be higher.
Calculating the score like it is done using the csv files seems to make more sense to me as it take into account the properties of the used dataset and more accurately reflects the score specific to the dataset. I suspect that therefor the utility of the dataset obtained using CSV files will be higher.
Not sure. I think this depends on the context and use case.
I already suspected something like this. I can however see how this may be confusing for some users that expect the same result when visually seeing the same hierarchy in the GUI.
I turned this issue into an "enhancement". We could make the fact whether a functional definition of a hierarchy is available and should be used more transparent in the UI. Please note that you can remove the functional representations, by manually editing the hierarchy in the hierarchy viewer (not in the wizard) as a workaround.
What about expoerting and importing the finctional definition of the hierarchies at the same event of the hierarchies. This way, if functional definition is available, it can be used for more accurate loss calculation and one gets same result everytime.
What about expoerting and importing the finctional definition of the hierarchies at the same event of the hierarchies. This way, if functional definition is available, it can be used for more accurate loss calculation and one gets same result everytime.
This would probably solve the import/export problems in the UI. A fix for this in the API could be to disable the user from building the HierarchyBuilder themselves or giving a warning when doing so. This would avoid scenarios where the user builds the Hierarchy and passes the result to the configuration, removing the functional definition. At the moment a user could do this without the knowledge of the difference between Hierarchies and HierarchyBuilders.
Another option would be to merge the hierarchy and builder representation and working with a toggle that enables or disables the functional definition when available. This would however require a mayor restructure I think.
This would probably solve the import/export problems in the UI.
I think one can do the same in the API e.g. saving both hierarchy and functional definition then load them. I will try the above solution and propose a PR.
After investigating this behavior a bit further. I noticed that the code only calculates the shares in the scoring functions differently when using Redaction- and Interval-based builders. All other builder types are calculated identically to not having a functional definition. The utility metrics, on the other hand, are only calculate differently when using a Redaction-based builders.
It's true that not all utility models make use of additional info provided by functional hierarchies and that not all hierarchy types provide such information.
Describe the bug ARX gives different optimal solutions when using hierarchyBuilders in comparison to using the hierarchy created from this builder.
To Reproduce Steps to reproduce the behavior:
Expected behavior I expected to get the same solution in both situations.
Files example.zip
ARX GUI (please complete the following information):