Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

group by grouping sets - adding a subgroup can change unrelated results. #4826

Open dandanlen opened 3 years ago

dandanlen commented 3 years ago

Not sure if this is a bug or a feature... either way I don't think it adds any privacy benefit since it can be worked around by decomposing the group by grouping sets query into multiple group by statements.

The following were run against https://demo.aircloak.com/data_sources/NAV_W1_TENANT_WS001.

                select
                    grouping_id(
                        "Quantity"
                    ),
                    "Quantity",
                    count(*),
                    count_noise(*)
                from "SalesCrMemoLine"
                group by grouping sets ((2))

☝️ This is equivalent to just group by 2 and I get the same result whichever way I write the query statement. So far, so good. Link

                select
                    grouping_id(
                        "Type","Quantity"
                    ),
                    "Type","Quantity",
                    count(*),
                    count_noise(*)
                from "SalesCrMemoLine"
                group by grouping sets ((2), (3))

☝️ This is equivalent to the union of group by 2 and group by 3 and I get the results I expect: The results contain the same two rows as in the query above (with different grouping_ids). Link

Now if instead I add a column with the bucket functions something weird happens.

                select
                    grouping_id(
                        bucket ("Line No_" by 2000.0),"Quantity"
                    ),
                    bucket ("Line No_" by 2000.0),"Quantity",
                    count(*),
                    count_noise(*)
                from "SalesCrMemoLine"
                group by grouping sets ((2), (3))

☝️ The two rows for the "Quantity" grouping are still there, but the numbers are totally different. It looks like the mere fact of including another group has added extra noise to the unrelated column.

In other examples, this can even cause a column to be suppressed or removed entirely from the result set.

TLDR; group by grouping sets ought to be equivalent to the union of the individual group by ${subgroup} queries. But in some cases adding an extra (and unrelated) grouping set can change results from other grouping sets.

cristianberneanu commented 3 years ago

Nice find!

Each grouping expression generates noise layers, which should apply only to those buckets containing it. It seems the Cloak always applies these noise layers globally, to all the buckets, returning different noise values for a bucket which should not be affected by unrelated grouping expressions.