arx-deidentifier / arx

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.
http://arx.deidentifier.org/
Apache License 2.0
620 stars 213 forks source link

Performance drop for sets with 11+ attributes (QIDs) #209

Closed felix134 closed 6 years ago

felix134 commented 6 years ago

Hello. I've noticed that there is a significant performance drop in case of processing time, when I anonymize approximate 11 or more QIDs with the arx framework. The question is: Are there config values I can modify to recieve a better result? Or is this just an ordinary behavior of the used k-anonymity approach (with k=5)? Note: The problem I've encountered may be related to #54. (?)

Example: I tried to simplify my actual problem, and created tests structured as follows: I used for all tests a data set of size 10. For each test I removed a QID attribute (and its hierarchy). In the following an extract of the code with 11 attributes (10 of them are QIDs):

    //init arx anonymizer
    ARXAnonymizer defaultAnonymizer = new ARXAnonymizer();
    defaultAnonymizer.setMaximumSnapshotSizeDataset(0.2);
    defaultAnonymizer.setMaximumSnapshotSizeSnapshot(0.8);
    defaultAnonymizer.setHistorySize(200);

    //init arx config
    ARXConfiguration config = ARXConfiguration.create();
    config.setMaxOutliers(0.02);
    config.addPrivacyModel(new KAnonymity(5));

    //create sample data
    String[][] set = new String[11][];
    set[0] = new String[]{"Category", "City", "Country", "Customer ID ID", "Customer Name", "Discount", "Order Date", "Order ID", "Postal Code", "Product ID", "Product Name"};
    set[1] = new String[]{"Furniture", "Henderson", "United States", "EA4B256A655A0D35D3035DB79C1D0F67A3705129", "Claire Gute", "0", "08/11/2016", "CA-2016-152156", "42420", "FUR-BO-10001798", "Bush Somerset Collection Bookcase"};
    set[2] = new String[]{"Furniture", "Henderson", "United States", "EA4B256A655A0D35D3035DB79C1D0F67A3705129", "Claire Gute", "0", "08/11/2016", "CA-2016-152156", "42420", "FUR-CH-10000454", "Hon Deluxe Fabric Upholstered Stacking Chairs  Rounded Back"};
    //[...]
    Data data = Data.create(set);

    //list of attributes present in the data set
    String[] attrList = new String[]{"Category", "City", "Country", "Customer ID ID", "Customer Name", "Discount", "Order Date", "Order ID", "Postal Code", "Product ID", "Product Name"};
    Arrays.sort(attrList);

    //creating hierarchies
    List<AttributeType.Hierarchy> hierarchyList = new LinkedList<>();
    String[] h0e0 = new String[]{"Furniture", "*"};
    String[] h0e1 = new String[]{"Office Supplies", "*"};
    String[] h0e2 = new String[]{"Technology", "*"};
    String[][] h0 = new String[][]{h0e0, h0e1, h0e2};
    AttributeType.Hierarchy hierarchy0 = AttributeType.Hierarchy.create(h0);
    hierarchyList.add(hierarchy0);
    //[...]

    //assigning hierarchies
    for (int i = 0; i < attrList.length; i++) {
        data.getDefinition().setAttributeType(attrList[i], hierarchyList.get(i));
    }

    //defining and assigning attribute types
    String[] privacyClasses = new String[]{"qid", "qid", "qid", "nonSen", "qid", "qid", "qid", "qid", "qid", "qid", "qid"};
    for (int i = 0; i < attrList.length; i++) {
        switch (privacyClasses[i]) {
            case "qid":
                data.getDefinition().setAttributeType(attrList[i], AttributeType.QUASI_IDENTIFYING_ATTRIBUTE);
                break;
            case "nonSen":
                data.getDefinition().setAttributeType(attrList[i], AttributeType.INSENSITIVE_ATTRIBUTE);
                break;
        }
    }

    //running anonymization
    ARXResult result = null;
    try {
        result = defaultAnonymizer.anonymize(data, config);
    } catch (IOException e) {
        e.printStackTrace();
    }
    assert result == null;
    System.out.println(result.getTime());
    System.out.println(result.isResultAvailable());`

The results are: Number of QIDs | processing time

Sincerely, Felix Bölz.

prasser commented 6 years ago

Dear Felix,

thanks for your interest in ARX! This "issue" is not related to #54 but with the runtime complexity of the underlying optimal search algorithm (execution time grows exponentially with the number of attributes). Please see our publications for further details. To mitigate this, you can use the heuristic algorithm implemented in ARX. You can find an example, in Example34.java

Best Fabian

felix134 commented 6 years ago

Thank you very much for your answer!