Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Stuck in processing - regrouping in anon slow? #2396

Closed sebastian closed 6 years ago

sebastian commented 6 years ago

When there are a high number of buckets we spend a disproportionate amount of time in the "processing" (well anonymization) stage.

Take the following query on the Taxi database on the demo system:

SELECT 
  "jan08"."pickup_latitude" AS "pickup_latitude",
  "jan08"."pickup_longitude" AS "pickup_longitude"
FROM "jan08"
GROUP BY 1, 2

My hunch is that this is due to the fact that we are attempting to regroup the results to avoid anonymizing away all the data. This needs some optimizing before release!

Here some execution traces:

From our attack cloak (running 18.1.2):

Total query time: 5:30min, of those 3:45min were spent in processing low count users.

13:21:36.806 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Running statement `SELECT
  "jan08"."pickup_latitude" AS "pickup_latitude",
  "jan08"."pickup_longitude" AS "pickup_longitude"
FROM "jan08"
GROUP BY 1, 2` ...
13:21:36.806 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Query 7915a488-f378-42d8-95b3-2c3f372c3f56 state changed to: parsing...
13:21:36.856 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Query 7915a488-f378-42d8-95b3-2c3f372c3f56 state changed to: compiling...
13:21:36.859 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Connecting to `taxi` ...
13:21:36.859 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Query 7915a488-f378-42d8-95b3-2c3f372c3f56 state changed to: awaiting_data...
13:21:36.872 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Selecting data ...
13:21:36.873 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Processing final rows ...
13:21:36.874 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Processing data using 2 processes ...
13:21:37.756 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Query 7915a488-f378-42d8-95b3-2c3f372c3f56 state changed to: ingesting_data...
13:22:09.897 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Integrating partial results ...
13:22:24.905 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Initializing anonymizer ...
13:22:24.905 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Query 7915a488-f378-42d8-95b3-2c3f372c3f56 state changed to: processing...
13:23:22.613 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Processing low count users ...
13:27:07.763 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Aggregating groups ...
13:27:09.462 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Making explicit buckets ...
13:27:09.464 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Query 7915a488-f378-42d8-95b3-2c3f372c3f56 state changed to: post_processing...
13:27:10.096 [debug] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 Disconnecting ...
13:27:10.390 [info] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 JSON_LOG {"type":"query_complete","status":"Successful.","query_id":"7915a488-f378-42d8-95b3-2c3f372c3f56","execution_time":333583}
13:27:10.394 [info] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 sending query result to Air
13:27:10.396 [info] query_id=7915a488-f378-42d8-95b3-2c3f372c3f56 query result sent

From a demo cloak (running master):

Total query time: 7:50min, of those 6:12min were spent in processing low count users.

13:19:00.560 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Running statement `SELECT
  "jan08"."pickup_latitude" AS "pickup_latitude",
  "jan08"."pickup_longitude" AS "pickup_longitude"
FROM "jan08"
GROUP BY 1, 2` ...
13:19:00.560 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Query 55fe4bc5-6add-459f-8eb0-d15c2b986a06 state changed to: parsing...
13:19:00.603 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Query 55fe4bc5-6add-459f-8eb0-d15c2b986a06 state changed to: compiling...
13:19:00.607 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Acquiring connection to `Taxi` ...
13:19:00.608 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Query 55fe4bc5-6add-459f-8eb0-d15c2b986a06 state changed to: awaiting_data...
13:19:00.624 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Selecting data ...
13:19:00.626 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Processing final rows ...
13:19:00.629 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Processing data using 8 processes ...
13:19:01.393 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Query 55fe4bc5-6add-459f-8eb0-d15c2b986a06 state changed to: ingesting_data...
13:19:08.338 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Integrating partial results ...
13:19:33.717 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Initializing anonymizer ...
13:19:33.717 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Query 55fe4bc5-6add-459f-8eb0-d15c2b986a06 state changed to: processing...
13:20:24.254 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Processing low count users ...
13:26:36.432 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Aggregating groups ...
13:26:47.065 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Making explicit buckets ...
13:26:49.025 [debug] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 Query 55fe4bc5-6add-459f-8eb0-d15c2b986a06 state changed to: post_processing...
13:26:49.038 [info] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 JSON_LOG {"type":"query_complete","status":"Successful.","query_id":"55fe4bc5-6add-459f-8eb0-d15c2b986a06","execution_time":468478}
13:26:49.161 [warn] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 operation `:encode_chunks` took 116ms
13:26:49.161 [info] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 sending query result to Air
13:26:49.560 [info] query_id=55fe4bc5-6add-459f-8eb0-d15c2b986a06 query result sent
cristianberneanu commented 6 years ago

AFAIK, sequential *-grouping was not included in release 18.1.* and it is present only in the current master. This seems to be a problem with the classical part of processing LCF-buckets, which is slow because of the need to partition the buckets and then merge the low-count ones.

sebastian commented 6 years ago

Correct, the sequential *-grouping only exists in the master version (i.e. the demo system, and not the attack system). And in the demo system (i.e. master) the "processing low count" users takes 80% of the total query execution time, whereas in the 18.1.* version it takes 68% of the total time.

So in other words: it takes a very high fraction of the overall time in both cases, but significantly more in the master version which has the sequential *-grouping.

cristianberneanu commented 6 years ago

Unfortunately, there isn't much that I can do here. Combining buckets is slow and the new *-grouping algorithm increases the amount of merges needed. I can revert the change, if you think it is not worth the cost.

sebastian commented 6 years ago

No, don’t revert. Let’s however keep it in mind as an area that would benefit from optimisations!

cristianberneanu commented 6 years ago

@sebastian once the above is merged, please check again if the performance for low-count aggregation is acceptable. On my machine, the time dropped from 21s to 6s. If this version is still too slow, the only option remaining is to limit the column depth at which the low-count checks are done.

sebastian commented 6 years ago

It seems the improvements are quite good. I think we might want to adjust the depth too, but for the moment I think we are fine with the way it is.

Before:

overhead-master

After:

overhead-branch