Performance drop for sets with 11+ attributes (QIDs)

felix134 commented 6 years ago

Hello. I've noticed that there is a significant performance drop in case of processing time, when I anonymize approximate 11 or more QIDs with the arx framework. The question is: Are there config values I can modify to recieve a better result? Or is this just an ordinary behavior of the used k-anonymity approach (with k=5)? Note: The problem I've encountered may be related to #54. (?)

Example: I tried to simplify my actual problem, and created tests structured as follows: I used for all tests a data set of size 10. For each test I removed a QID attribute (and its hierarchy). In the following an extract of the code with 11 attributes (10 of them are QIDs):

    //init arx anonymizer
    ARXAnonymizer defaultAnonymizer = new ARXAnonymizer();
    defaultAnonymizer.setMaximumSnapshotSizeDataset(0.2);
    defaultAnonymizer.setMaximumSnapshotSizeSnapshot(0.8);
    defaultAnonymizer.setHistorySize(200);

    //init arx config
    ARXConfiguration config = ARXConfiguration.create();
    config.setMaxOutliers(0.02);
    config.addPrivacyModel(new KAnonymity(5));

    //create sample data
    String[][] set = new String[11][];
    set[0] = new String[]{"Category", "City", "Country", "Customer ID ID", "Customer Name", "Discount", "Order Date", "Order ID", "Postal Code", "Product ID", "Product Name"};
    set[1] = new String[]{"Furniture", "Henderson", "United States", "EA4B256A655A0D35D3035DB79C1D0F67A3705129", "Claire Gute", "0", "08/11/2016", "CA-2016-152156", "42420", "FUR-BO-10001798", "Bush Somerset Collection Bookcase"};
    set[2] = new String[]{"Furniture", "Henderson", "United States", "EA4B256A655A0D35D3035DB79C1D0F67A3705129", "Claire Gute", "0", "08/11/2016", "CA-2016-152156", "42420", "FUR-CH-10000454", "Hon Deluxe Fabric Upholstered Stacking Chairs  Rounded Back"};
    //[...]
    Data data = Data.create(set);

    //list of attributes present in the data set
    String[] attrList = new String[]{"Category", "City", "Country", "Customer ID ID", "Customer Name", "Discount", "Order Date", "Order ID", "Postal Code", "Product ID", "Product Name"};
    Arrays.sort(attrList);

    //creating hierarchies
    List<AttributeType.Hierarchy> hierarchyList = new LinkedList<>();
    String[] h0e0 = new String[]{"Furniture", "*"};
    String[] h0e1 = new String[]{"Office Supplies", "*"};
    String[] h0e2 = new String[]{"Technology", "*"};
    String[][] h0 = new String[][]{h0e0, h0e1, h0e2};
    AttributeType.Hierarchy hierarchy0 = AttributeType.Hierarchy.create(h0);
    hierarchyList.add(hierarchy0);
    //[...]

    //assigning hierarchies
    for (int i = 0; i < attrList.length; i++) {
        data.getDefinition().setAttributeType(attrList[i], hierarchyList.get(i));
    }

    //defining and assigning attribute types
    String[] privacyClasses = new String[]{"qid", "qid", "qid", "nonSen", "qid", "qid", "qid", "qid", "qid", "qid", "qid"};
    for (int i = 0; i < attrList.length; i++) {
        switch (privacyClasses[i]) {
            case "qid":
                data.getDefinition().setAttributeType(attrList[i], AttributeType.QUASI_IDENTIFYING_ATTRIBUTE);
                break;
            case "nonSen":
                data.getDefinition().setAttributeType(attrList[i], AttributeType.INSENSITIVE_ATTRIBUTE);
                break;
        }
    }

    //running anonymization
    ARXResult result = null;
    try {
        result = defaultAnonymizer.anonymize(data, config);
    } catch (IOException e) {
        e.printStackTrace();
    }
    assert result == null;
    System.out.println(result.getTime());
    System.out.println(result.isResultAvailable());`

The results are: Number of QIDs | processing time

10: 0.2 sec
11: 1.7sec
12: 2.8 sec
13: 9.7 sec [...]
19: 57.3 sec

Sincerely, Felix Bölz.

prasser commented 6 years ago

Dear Felix,

thanks for your interest in ARX! This "issue" is not related to #54 but with the runtime complexity of the underlying optimal search algorithm (execution time grows exponentially with the number of attributes). Please see our publications for further details. To mitigate this, you can use the heuristic algorithm implemented in ARX. You can find an example, in Example34.java

Best Fabian

felix134 commented 6 years ago

Thank you very much for your answer!

arx-deidentifier / arx

Performance drop for sets with 11+ attributes (QIDs) #209