gda-score / code

Tools for generating General Data Anonymity Scores (www.gda-score.org)
MIT License
7 stars 1 forks source link

uber_db, return synthesized sql #68

Open yoid2000 opened 3 years ago

yoid2000 commented 3 years ago

@reinhardmunz

uber_db generates its own sql from the sql supplied by the analyst.

it would be useful to see what that sql is.

Could you look into how easy or hard it would be to return that sql along with the answer?

yoid2000 commented 3 years ago

Actually, there is a bit more to this issue.

I am sending this SQL query to the uber_db:

       select count(*)
       from accounts
       where 1/(acct_district_id - 11) = 1

If I look at the log of the postgres database (suppled by Pascal) I see this entry:

2020-11-10 10:57:07.464 CET [46127] rohan@rhrk.uni-kl.de@raw_banking[139.19.217.23] ERROR:  division by zero
2020-11-10 10:57:07.464 CET [46127] rohan@rhrk.uni-kl.de@raw_banking[139.19.217.23] STATEMENT:  
           select count(*)
           from accounts
           where 1/(acct_district_id - 11) = 1

In other words, a divide-by-zero error was triggered. However, this is what the uber_db app returns:

{   'answer': [[4213695455.074402]],
    'cells': 1,
    'query': {   'cache': True,
                 'epsilon': 1e-09,
                 'sql': '\n'
                        '       select count(*)\n'
                        '       from accounts\n'
                        '       where 1/(acct_district_id - 11) = 1\n'
                        '    '},
    'remaining_dp_budget': 0.499999999,
    'stillToCome': 0}

In other words, the fact that there was an error is hidden, and an answer is returned instead.

On the other hand, if I execute this:

       select count(*)
       from no_table
       where some_garbage

Then I do see an error message from uber_db. (I still need to check if it is being executed on the postgres db or if uber is generating the error itself)

reinhardmunz commented 3 years ago

On first thought I would say that this might be intentional behavior because otherwise you could reveal secret data through an error. Differential privacy demands that that there are no outputs like errors. So if the query is syntactically and schema-wise ok it must run and produce a numerical output. Otherwise you could check for values by triggering the error. Meaning if value is X there is the error and if it is Y there is no error.

Your query for some_garbage should fail as it does not pass the schema check.

Similarly I think the problem with cli_district_id is the schema check, just that there is a typo in the schema file of the Uber tool. I will fix that now.

reinhardmunz commented 3 years ago

I would not recommend to change the Uber tool. That looks like a lot of work for the desired feature.

For debugging purposes the queries are in the log files.

Example of flask-server log:

Nov 10 15:22:43 srv-53-01 python3.6[14811]: JSON sent by client: {'sid': 437948228, 'dbname': 'raw_banking', 'budget': '6.0', 'query': 'Select count(*) from accounts', 'epsilon': '2.0'}
Nov 10 15:22:43 srv-53-01 python3.6[14811]: Session 437948228: Client provided epsilon: 2.0
Nov 10 15:22:43 srv-53-01 python3.6[14811]: Session 437948228: JSON Request written to /home/flask/files/jsonreq/data437948228.json is {'query': 'Select count(*) from accounts', 'budget': '6.0', 'epsilon': '2.0', 'sid': 437948228, 'used_budget': 3.0, 'dbname': 'raw_bankin
Nov 10 15:22:48 srv-53-01 python3.6[14811]: Session 437948228: Reading result file with most recent inode metadata changes (oldest ctime): /home/flask/files/noisyres/result2020-11-10_15-22-44.txt
Nov 10 15:22:48 srv-53-01 python3.6[14811]: Session 437948228: Returning result={5369.448218592423}
Nov 10 15:22:48 srv-53-01 python3.6[14811]: 127.0.0.1 - - [10/Nov/2020 15:22:48] "GET /data HTTP/1.1" 200 -

Example of ElasticSense (Uber tool) log:

Nov 10 15:22:43 srv-53-01 java[14735]: /home/flask/files/jsonreq/data437948228.json
Nov 10 15:22:44 srv-53-01 java[14735]: Query sent by client: Select count(*) from accounts
Nov 10 15:22:44 srv-53-01 java[14735]: Private result: 5369.0
Nov 10 15:22:44 srv-53-01 java[14735]: Noisy result: 5369
Nov 10 15:22:44 srv-53-01 java[14735]: File Created!

Postgres only logs queries if there is an error. Example:

2020-11-10 14:48:42.625 CET [10005] rohan@rhrk.uni-kl.de@raw_banking[139.19.217.23] ERROR:  division by zero
2020-11-10 14:48:42.625 CET [10005] rohan@rhrk.uni-kl.de@raw_banking[139.19.217.23] STATEMENT:  
               select count(*)
               from accounts
               where 1/(acct_district_id - 11) = 1