Closed sebastian closed 6 years ago
AFAIK, sequential *
-grouping was not included in release 18.1.*
and it is present only in the current master
. This seems to be a problem with the classical part of processing LCF-buckets, which is slow because of the need to partition the buckets and then merge the low-count ones.
Correct, the sequential *
-grouping only exists in the master
version (i.e. the demo system, and not the attack system). And in the demo system (i.e. master
) the "processing low count" users takes 80% of the total query execution time, whereas in the 18.1.*
version it takes 68% of the total time.
So in other words: it takes a very high fraction of the overall time in both cases, but significantly more in the master
version which has the sequential *
-grouping.
Unfortunately, there isn't much that I can do here. Combining buckets is slow and the new *
-grouping algorithm increases the amount of merges needed. I can revert the change, if you think it is not worth the cost.
No, don’t revert. Let’s however keep it in mind as an area that would benefit from optimisations!
@sebastian once the above is merged, please check again if the performance for low-count aggregation is acceptable. On my machine, the time dropped from 21s to 6s. If this version is still too slow, the only option remaining is to limit the column depth at which the low-count checks are done.
It seems the improvements are quite good. I think we might want to adjust the depth too, but for the moment I think we are fine with the way it is.
Before:
After:
When there are a high number of buckets we spend a disproportionate amount of time in the "processing" (well anonymization) stage.
Take the following query on the Taxi database on the demo system:
My hunch is that this is due to the fact that we are attempting to regroup the results to avoid anonymizing away all the data. This needs some optimizing before release!
Here some execution traces:
From our attack cloak (running
18.1.2
):Total query time: 5:30min, of those 3:45min were spent in processing low count users.
From a demo cloak (running
master
):Total query time: 7:50min, of those 6:12min were spent in processing low count users.