diffix / reference

Reference implementation for the Open Diffix anonymization mechanism.
https://www.open-diffix.org
Other
3 stars 0 forks source link

Minor unrelated tweaks #314

Closed pdobacz closed 2 years ago

pdobacz commented 2 years ago

A loose collection of tweaks I gathered during past several days.

If the one about collecting aggregators is not worth it for any reason I'm missing, please let me know, we'll drop it

pdobacz commented 2 years ago

I don't understand why we need to collect all aggregators in bulk from a query tree. When (or if) we will need to support pre-processing subqueries, we will need to treat pre-processing, anonymizing, post-processing aggregators differently. For now, I think we should make reference always anonymize the bottom most subqueries and worry about pre-processing sensitive data later (once we fully support all the clause in a single query).

This might be an answer to the question asked in the PR description. I'm not sure we need this at all. The question is, should queries like those added in the tests, be valid or not, given current implementation of reference?

I'm guessing, this particular validation pass we're discussing here, aims to protect the aggregation hooks from some pathological queries - are we OK with letting them through?

cristianberneanu commented 2 years ago

The question is, should queries like those added in the tests, be valid or not, given current implementation of reference?

I don't think they should be added yet, because reference is not yet designed to handle these cases, so they would be testing whatever behavior (correct or not) currently exists.

pdobacz commented 2 years ago

After sleeping on it, I think, the added test cases are testing correct behavior being: "Post-aggregation hooks are being protected from queries containing multiple LCF aggregators. Instead of producing (possibly) nonsense results, they reject the query and produce a meaningful error". The behavior was incomplete (didn't descend to subqueries/joins), and the change fills this gap.

Is this reasoning correct?

edongashi commented 2 years ago

The hooks also do a execution time check by calling AggregationContext.lowCountIndex which allows only 0 or 1 matches.

pdobacz commented 2 years ago

The hooks also do a execution time check by calling AggregationContext.lowCountIndex which allows only 0 or 1 matches.

oh, so do we need this check validateSingleLowCount here at all :) ?

edongashi commented 2 years ago

We need it if we want to fail early during analysis. Otherwise we have to wait until all the buckets have been built during execution which can take a while.

pdobacz commented 2 years ago

ok since this seems to require more discussion, I'm splitting of the validateSingleLowCount out to a separate branch and keeping only truly minor tweaks here