diffix / pg_diffix

Implementation of the Open Diffix anonymization mechanism for PostgreSQL.
https://www.open-diffix.org
MIT License
18 stars 2 forks source link

Update flattening mechanism to match `reference` #138

Closed pdobacz closed 2 years ago

pdobacz commented 2 years ago

I think that this is the issue which initially caused the prop tests from https://github.com/diffix/reference/pull/302 to fail.

Example query which diverges:

reference

dotnet run --project src/OpenDiffix.CLI/ -- --aid-columns customers.id --low-threshold 2 --low-mean-gap 0 --layer-noise-sd 0 --low-layer-sd 0 --outlier-count 1 1 --top-count 6 6  -f data/data.sqlite -q "SELECT company_name, count(*), count(distinct id) as count_distinct FROM customers GROUP BY 1 ORDER BY 1 ;"

"company_name","count","count_distinct"
"Alpha Centauri Inc.",60,60
"Beta Centauri Ltd",98,98
"Gamma Centauri GmbH",42,42
"Outlier Inc.",2,2

pg_diffix

prepare matching anon params (user must have anonymizing access, but be SUPERUSER)

UPDATE pg_settings SET setting = 0 WHERE name = 'pg_diffix.noise_sigma';
UPDATE pg_settings SET setting = 1 WHERE name = 'pg_diffix.outlier_count_min'; UPDATE pg_settings SET setting = 1 WHERE name = 'pg_diffix.outlier_count_max'; UPDATE pg_settings SET setting = 6 WHERE name = 'pg_diffix.top_count_min'; UPDATE pg_settings SET setting = 6 WHERE name = 'pg_diffix.top_count_max';

query:

SELECT company_name, count(*), count(distinct id) as count_distinct FROM customers GROUP BY 1 ORDER BY 1 ;

    company_name     | count | count_distinct 
---------------------+-------+----------------
 Alpha Centauri Inc. |    60 |             60
 Beta Centauri Ltd   |    98 |             98
 Gamma Centauri GmbH |    42 |             42
 Outlier Inc.        |       |               
(4 rows)

(NOTE the difference in the last row 2 vs Null)

The pg_diffix used uses a workaround for #134, contained in this piotr/spike-property-tests branch

(NOTE 2 the top_count and outlier_count param values cause the flattening to be noiseless)

pdobacz commented 2 years ago

I'm going to trim this down to anon_count and anon_count_any, so without anon_count_distinct. The latter will be up to a new issue