Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Plain floating (or probing in disguise) #2715

Open obrok opened 6 years ago

obrok commented 6 years ago

I'm creating this issue to discuss, and hopefully generate a useful definition of floating, which seems to be the baseline of what we need to do to improve anonymization.

obrok commented 6 years ago

@yoid2000 @sebastian How did you imagine the interaction between floating and GROUP BY? Let's take this query:

SELECT SUM(things)
FROM (
  SELECT uid, count(a) AS things
  FROM table
  WHERE b <> 3
  GROUP BY uid
) x

It seems to me that what would need to happen is we would float b and issue, the query without the WHERE b <> 3. Next, if there is low count users with b = 3 we act as if the condition was never there (no noise layer, etc.). However, because there is an aggregation in the subquery it seems like the only option to do this correctly would be to not aggregate in the database, instead fetch all rows and compute count(a) in the cloak after deciding if the b <> 3 condition is valid. This will probably have some heavy performance repercussions, beyond just fetching the rows where b = 3.

sebastian commented 6 years ago

Your description of how it would have to happen seems to match my expectations... however you are absolutely right that the performance implications would be quite severe. @yoid2000 do you have any shortcuts in mind?

yoid2000 commented 6 years ago

One idea might be to borrow an idea from #2561 and use a CASE statement to pull up computations for both with and without the condition. An example is below (I haven't checked this for correctness, though all the examples in #2561 have been checked). But even if incorrect you get the idea.

SELECT SUM(things_without_cond),       - use this sum if condition is LCF
       SUM(things_with_cond),          - use this sum if condition is not LCF
       SUM(num_with_cond)              - use this to decide if condition is LCF or not
FROM (
  SELECT uid, count(a) AS things_without_cond, 
         count(a_with_cond) AS things_with_cond,
         max(lcf) AS num_with_cond
  FROM (
    SELECT uid, a,
           CASE WHEN b = 3 THEN 1
                ELSE 0
           END AS lcf,
           CASE WHEN b <> 3 THEN a
                ELSE NULL
           END AS a_with_cond
    FROM table
   ) t1
   GROUP BY uid
) x
yoid2000 commented 6 years ago

Well, with the above, the outer SELECT would be computed in the cloak. The inner select would be sent to the database if SQL ....

sebastian commented 6 years ago

How does this scale for multiple conditions? I.e.

SELECT SUM(things)
FROM (
  SELECT uid, count(a) AS things
  FROM table
  WHERE b NOT IN (1, 2, 3, 4, 5, 6, 7, ...) and c <> 'bar'
  GROUP BY uid
) x

Would we create a count per condition? And then detect which are low count in the cloak and subsequently adjust the counts?

I.e. something like:

  SELECT 
    uid, count(a) AS things,
    sum(b1) as count_b1,
    sum(b2) as count_b2, 
    ...
    sum(b7) as count_b7, 
    sum(cbar) as count_cbar
  FROM (
    SELECT uid, a,
           CASE WHEN b = 1 THEN 1 ELSE 0 END AS b1,
           CASE WHEN b = 2 THEN 1 ELSE 0 END AS b2,
           ...
           CASE WHEN b = 7 THEN 1 ELSE 0 END AS b7,
           CASE WHEN c = 'bar' THEN 1 ELSE 0 END AS cbar,
    FROM table
   ) t1
   GROUP BY uid
obrok commented 6 years ago

I'm wondering if we shouldn't write a prototype that would not try to push any operations into the DB (except maybe SAMPLE_USERS). It might turn out that doing everything in the cloak is faster or comparable after you factor in all the hoops we have to jump through to be able to push into DB.

sebastian commented 6 years ago

I'm wondering if we shouldn't write a prototype that would not try to push any operations into the DB (except maybe SAMPLE_USERS).

You mean specifically for floating only?

I suppose it really depends on the frequency at which we have to do floating at all. If we get good at detecting when it is not needed then a "simpler approach" which performs it in the cloak might be fine and very well preferable due to the reduced complexity.

Of course it's hard to get a good sense of how well this works. We lack data to make good decisions:

DZ bank for example has about 6TB of compressed data. If they write an unlucky query we are in deep shit.

yoid2000 commented 6 years ago

When there are multiple negands, then you can do the CASE trick for each negand as you suggest, but you can't realistically compute the output for every possible combination of LCF negands. So if more than one negand is LCF, then you would need to drop those negands are re-query. This is how it would work in the no-uid approach as well.

sebastian commented 6 years ago

When there are multiple negands, then you can do the CASE trick for each negand as you suggest, but you can't realistically compute the output for every possible combination of LCF negands.

Why do you need to do the combinations though? If you just count the negands individually as well as the total count then you should easily be able to see which conditions can be included and then adjust the count?

sebastian commented 6 years ago

Well, I guess it somehow multiplies if you have many nested subqueries..?

For example the subquery:

SELECT uid, aggregate(...)
FROM (
  SELECT uid, a, aggregate(...)
  FROM table
  WHERE b NOT IN (1, 2, 3)
  GROUP BY 1, 2
) t
WHERE a NOT IN (4, 5, 6)
GROUP BY uid
yoid2000 commented 6 years ago

sorry pushed the wrong button

obrok commented 6 years ago

When there are multiple negands, then you can do the CASE trick for each negand as you suggest, but you can't realistically compute the output for every possible combination of LCF negands. So if more than one negand is LCF, then you would need to drop those negands are re-query. This is how it would work in the no-uid approach as well.

What this seems to tell me is that it's not floating but probing that is the most basic thing we need? Perhaps with one giant probe that probes all conditions at the same time?

sebastian commented 6 years ago

What this seems to tell me is that it's not floating but probing that is the most basic thing we need? Perhaps with one giant probe that probes all conditions at the same time?

I think you are right. It's really a probe we want. The trickery (and complexity) above stems from the fact that probes can be prohibitively expensive to run (say for TeamBank, where a single query takes on the order of half an hour due to all the data being loaded into the cloak) and we therefore try to bake the probes into the normal query itself.

sebastian commented 6 years ago

Well... I guess we really want probes all the way... Whether they are satisfied by clever query magic, a shadow table, or other heuristics then becomes query and implementation dependent... and floating is really just a hack to achieve probing too.

obrok commented 6 years ago

The trickery (and complexity) above stems from the fact that probes can be prohibitively expensive to run (say for TeamBank, where a single query takes on the order of half an hour due to all the data being loaded into the cloak) and we therefore try to bake the probes into the normal query itself.

But if we run a more complex query to maybe avoid the probe and then have to probe anyway, then that's even worse - difficult to say in the abstract which case will win out in practice.

yoid2000 commented 6 years ago

I think sooner or later we'll need to be able to make multiple trips to the database to satisfy a single query. If nothing else, this is likely to come up with some of the more complex ML or statistical operations. So I think it makes sense to accept this early on and build it into the framework.

obrok commented 6 years ago

I think sooner or later we'll need to be able to make multiple trips to the database to satisfy a single query. If nothing else, this is likely to come up with some of the more complex ML or statistical operations. So I think it makes sense to accept this early on and build it into the framework.

If we need to make N trips to satisfy a single query that suggests we would need to make N * M trips to satisfy an ML operations requiring M normal queries, exacerbating the problem.

sebastian commented 6 years ago

But if we run a more complex query to maybe avoid the probe and then have to probe anyway, then that's even worse - difficult to say in the abstract which case will win out in practice.

I do not understand the part about "... and then have to probe anyway". The complex query would include the probe already (or rather be designed with the explicit purpose of avoiding separate probes), so would not require further secondary queries?

Of course what performs better is hard to say generically and for all cases, but we certainly already know of cases already where multiple sequentially executed queries are going to be detrimental to performance.

I think sooner or later we'll need to be able to make multiple trips to the database to satisfy a single query. If nothing else, this is likely to come up with some of the more complex ML or statistical operations. So I think it makes sense to accept this early on and build it into the framework.

I do not agree with this statement. Whether we need to do multiple queries for machine learning queries or not is altogether different from whether we need to make multiple queries for descriptive queries. ML and descriptive queries are two very distinct operations, and we should not confuse or mix them! One comes with the expectation of rapid results (namely the descriptive queries) whereas machine learning queries do not. I really do not want us to build multi-query support mechanisms for our descriptive queries under the un-validated assumption that it might make it easier to do machine learning stuff down the line.

If we need to make N trips to satisfy a single query that suggests we would need to make N * M trips to satisfy an ML operations requiring M normal queries, exacerbating the problem.

I believe @yoid2000 might be thinking of cases where we could integrate the machine learning aspects inside the cloak directly. However if so, then that is still some way out, and furthermore likely to require rather different mechanisms than what we have today, and in particularly different mechanisms than we need to solve the problem at hand.

obrok commented 6 years ago

I do not understand the part about "... and then have to probe anyway".

@yoid2000 said:

So if more than one negand is LCF, then you would need to drop those negands are re-query. This is how it would work in the no-uid approach as well.

From this I gather that sometimes we would detect after floating that it wasn't enough and need to issue another probe.

I really do not want us to build multi-query support mechanisms for our descriptive queries under the un-validated assumption that it might make it easier to do machine learning stuff down the line.

I totally agree with this point. Trying to predict what will be useful for some nebulous future goal is a great way to overengineer and spend time spinning your wheels.

sebastian commented 6 years ago

Of the queries we have recorded for TeamBank there is only one that has a negand in a subquery with an aggregate:

https://github.com/Aircloak/aircloak/blob/master/cloak/test/supported_queries/teambank_test.exs#L274-L311:

    SELECT
      bucket(avg_betrag by 500 align middle) AS income_class,
      bucket(avg_kontostand by 500 align middle) AS kontostand,
      count(distinct inhaberId) AS num_Acc,
      count_noise(distinct inhaberId)
    FROM (
      SELECT
        konto.inhaberId,
        avg(betrag) as avg_betrag,
        avg(kontostand.betrag) as avg_kontostand
      FROM konto INNER JOIN bankzugang
        ON konto.inhaberId = bankzugang.inhaberId AND
          blz NOT IN ('90090042', '47110815') AND
          konto.bankzugangId = bankzugang._id
      INNER JOIN (
        SELECT
          inhaberId,
          kontoId,
          betrag
        FROM umsatz
        WHERE
          UPPER(umsatzeigenschaften.spezifizierung) IN
            ('LOHN', 'GEHALT', 'LOHN/GEHALT', 'BEZÜGE (BEAMTE)', 'BEZÜGE',
            'LOHN_GEHALT', 'BEZUEGE_BEAMTE', 'BEZUEGE') AND
          buchungsDatum BETWEEN '2017-10-29' AND '2017-11-28'
      ) as umsatz ON umsatz.inhaberId = konto.inhaberId AND
        umsatz.kontoId = konto._id
      WHERE konto.letzterUmsatz.buchungsDatum BETWEEN '2017-10-29' AND '2017-11-28'
      GROUP BY konto.inhaberId
    ) average_values
    GROUP BY income_class, kontostand
    ORDER BY income_class, kontostand

in this is a case where a shadow table might in fact potentially have helped (or maybe not...)


We really lack good data! This is a problem! And with the query stats we are collecting (and the ways we can query it) we cannot answer these types of questions at all...

yoid2000 commented 6 years ago

What is blz?

sebastian commented 6 years ago

It's the bank routing number. I.e. before IBAN came into effect one used bank account number in combination with the routing number that identified the bank. Now it's all baked into the IBAN I think... but somehow it's still relevant.

yoid2000 commented 6 years ago

So in your query they decided to remove two banks from the answer? If there were more than 20 or so banks, then there is a good chance that the shadow table would have decided to include the negands.

sebastian commented 6 years ago

There are hundreds of distinct banks in this case. But the table would probably still have had entries for them.

cristianberneanu commented 6 years ago

I am not sure if they know, but if they always want to exclude these two banks, it might be a lot better for them if they do it from a virtual table. But we need to test that solution first, I don't think virtual tables and projected tables interact with one another very well.

sebastian commented 6 years ago

I am not sure if they know, but if they always want to exclude these two banks, it might be a lot better for them if they do it from a virtual table. But we need to test that solution first, I don't think virtual tables and projected tables interact with one another very well.

Yes, they would be very well served by excluding them everywhere! We should test and tell them how to do it!

sebastian commented 6 years ago

Re CASE-statements for probing purposes. I did some thinking on paper. I have to do more thinking across subquery boundaries etc, but my preliminary findings re how many case statements (columns) one would need to compute and load out of the database for POSORs or NEGANDs are:

I am not sure what the conclusion here is at the moment, other than that it might be a feasible approach, and that it certainly would generate rather hairy SQL.

I'll think through how the probe count is affected by subqueries later...

yoid2000 commented 6 years ago

@sebastian this analysis assumes you want to case-probe for all possible LCF combinations right? (Versus having to make a second query after you find out which posors or negands are individually low count)

PF

sebastian commented 6 years ago

@sebastian this analysis assumes you want to case-probe for all possible LCF combinations right? (Versus having to make a second query after you find out which posors or negands are individually low count)

Either or. If you look at the 4th version you'll see I can check for low effect with |a| number of probes, but that |ab| + |a| + |b| number of probes/case statement statements are needed if you want to avoid an additional subsequent query.

yoid2000 commented 6 years ago

So what this says to me is that we can't really avoid the need to sometimes make a second query, but as long as we are willing to do that, the number of case-probes scales linearly with the number of negands and posors?

sebastian commented 6 years ago

So what this says to me is that we can't really avoid the need to sometimes make a second query,

I think we really should try to avoid doing second queries! For small number of filter terms it's absolutely viable to add more probes, rather than require second queries? Well, we should measure these things of course.

but as long as we are willing to do that, the number of case-probes scales linearly with the number of negands and posors?

Not necessarily true. I still haven't looked into sub-queries and how this works across aggregate boundaries in subqueries etc. I did the absolute simplest and easiest cases first so I could go into the weekend with some level of optimism.

yoid2000 commented 6 years ago

Not necessarily true. I still haven't looked into sub-queries and how this works across aggregate boundaries in subqueries etc. I did the absolute simplest and easiest cases first so I could go into the weekend with some level of optimism.

To make sure we aren't talking at cross purposes, lets think in terms of discovering low-effect conditions, and computing answers to account for low-effect conditions.

I believe that for discovery, it is sufficient to have one case-probe per condition, because for each condition we are independently trying to discover if it is low-effect. Sub-queries etc. shouldn't matter for just discovery.

If you want to do computing in one query, then you need to account for combinations of conditions in the case-probes. If you are willing to do computing in two queries, then you don't.

sebastian commented 6 years ago

I believe that for discovery, it is sufficient to have one case-probe per condition, because for each condition we are independently trying to discover if it is low-effect.

When you have multiple posors on distinct columns you need |ab| etc number of probes just for discovery.

Sub-queries etc. shouldn't matter for just discovery.

The sub queries need probes too, particularly where the condition is in a sub-query! So I think discovery matters as much in sub queries as in top-level queries.

Take the following query as an example:

SELECT SUM(ind_c) FROM (
  SELECT uid, count(*) as ind_c
  FROM table
  WHERE c IN (1, 2, 3) and d IN (3, 4, 5)
)

If you want to do computing in one query, then you need to account for combinations of conditions in the case-probes. If you are willing to do computing in two queries, then you don't.

True for some of the condition types, but unfortunately not generally true. See the posor case.

sebastian commented 6 years ago

Hm... how does this generalize for negands and posors in JOIN-conditions? Maybe it's just a matter of pushing the conditions into the table/subquery that is being joined?

Example from the TeamBank query above in simplified form:

SELECT count(*) 
FROM tableA ta INNER JOIN tableB tb on ta.uid = tb.uid and ta.col1 NOT IN (1, 2, 3)

Could become:

SELECT count(*) 
FROM (
  SELECT uid
  FROM tableA
  WHERE col1 NOT IN (1, 2, 3)
) ta INNER JOIN tableB tb on ta.uid = tb.uid
sebastian commented 6 years ago

It's fiddly, but not that hard to extract values for when a simple condition applies and when it does not. For a global aggregate it could be done along the lines of:

SELECT count(*), sum(length(name))
FROM demographics
WHERE age NOT IN (29, 30)

becoming:

SELECT
  count(*),
  sum(length(name)),
  count(distinct CASE WHEN age = 29 THEN uid ELSE null END) as age_29_exclusion,
  count(distinct CASE WHEN age = 30 THEN uid ELSE null END) as age_30_exclusion,
  SUM(CASE WHEN age = 29 THEN length(name) ELSE 0 END) as age_29_exclusion_sum_length_name,
  SUM(CASE WHEN age = 30 THEN length(name) ELSE 0 END) as age_30_exclusion_sum_age_name
FROM demographics

However the interesting case for the time being is when we are still loading out the per-user rows so we can perform the anonymization inside the cloak. A query like:

SELECT SUM(ind_count)
FROM (
  SELECT uid, count(*) as ind_count
  FROM demographics
  WHERE age <> 30
  GROUP BY uid
);

could result in us loading out the data for a query such as:

SELECT
  uid,
  count(*) as ind_count,
  SUM(CASE WHEN age = 30 THEN 1 ELSE 0 END) as age_30_exclusion
FROM demographics
GROUP BY uid

we would then in the cloak have to check how many individual users were affected by the exclusion filter. If there are sufficient users, then we do nothing, but if there aren't enough users excluded we would have to adjust the aggregates for the users where the condition has to be removed... Now this is all doable, but I am very skeptical about the queries we would have to generate to support this kind of query. Effectively we have to rid the query of the WHERE-conditions in order to compute both the cases with and without the condition. It seems like this effectively leads to full table scans and horrible query performance? One would think this would be the case in the non-uid approach too, where the difference is that the subsequent post-processing and aggregation has to happen in the cloak too? It would however still most likely require us to compute over substantial parts of the tables.

sebastian commented 6 years ago

This does get hairy... If the condition is in a subquery, then we have to float it up one level and perform the same style as exclusion tests as above.

I.e. analyst query:

SELECT SUM(ind_count)
FROM (
  SELECT uid, count(*) as ind_count
  FROM (
    SELECT uid
    FROM demographics
    WHERE age <> 30
  ) d
  GROUP BY uid
) t;

would have to result in the condition being dropped and the column floated to the top-level user column where we can do the condition checking, resulting in the cloak query

SELECT
  uid,
  count(*) as ind_count,
  -- The number of rows that would have been excluded if the condition was in place.
  -- If there are enough distinct users with this condition, then we have to subtract
  -- `age_30_exclusion` from this users `ind_count`
  SUM(CASE WHEN age = 30 THEN 1 ELSE 0 END) as age_30_exclusion
FROM (
  SELECT uid, age
  FROM demographics
) d
GROUP BY uid;

Now if there is an aggregate in a sub-sub-query that itself contains a NEGAND, then we need to create the same conditional checks in the sub-sub-query for then to propagate these further back up to the cloak. I.e. the following user query:

SELECT SUM(ind_count)
FROM (
  SELECT uid, SUM(name_count) as ind_count
  FROM (
    SELECT uid, name, count(*) as name_count
    FROM demographics
    WHERE age <> 30
    GROUP BY uid, name
  ) d
  GROUP BY uid
) t;

would have to result in something along the following lines for the cloak?

SELECT
  uid,
  SUM(name_count) as ind_count,
  -- The number of total rows that would have been excluded for this user
  -- if the filter was applied. If there are enough distinct users with the filter,
  -- the `ind_count` should be decreased by `total_name_count_age_30_exclusion`
  -- for this user
  SUM(name_count_age_30_exclusion) as total_name_count_age_30_exclusion
FROM (
  SELECT
    uid,
    name,
    count(*) as name_count,
    -- The number of rows that would have been excluded if the condition was in place.
    -- It needs to floated up to the `cloak`
    SUM(CASE WHEN age = 30 THEN 1 ELSE 0 END) as name_count_age_30_exclusion
  FROM demographics
  GROUP BY uid, name
) d
GROUP BY uid
sebastian commented 6 years ago

All this seems easy enough for count's, but substantially more tricky when it comes to other aggregates, where we end up having to float a significantly higher and distinct set of values up to the cloak for it to understand what needs to happen...

sebastian commented 6 years ago

Moving this out of the current release. This release will contain fixes that solve the challenge problems.