Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Stats based standard deviation #3863

Open sebastian opened 5 years ago

sebastian commented 5 years ago

A design was present in the original no-uid approach issue. I am unable to link directly to the section, but you get there by searching for "Standard Deviation of a column".

Excerpt:

Analyst query:

SELECT stddev(amount)
FROM transactions

Stats-based query:

SELECT sqrt(sum(sm)/sum(cnt)) AS sd,
       count(DISTINCT client_id) AS duids,
       max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
       avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
   (SELECT client_id, sum(diff) AS sm, count(*) AS cnt
    FROM
      (SELECT client_id, pow(abs(amount - (SELECT avg(amount)
                                           FROM transactions)), 2) AS diff
       FROM transactions
      ) t1
    GROUP BY client_id
   ) t3

Note that the average needed for the standard deviation computation is computed as a SELECT embedded in the abs function. My postgres is smart enough to store the computed value and re-use, but not every implementation might.

There is a small difference in the answers of the analyst and no-uid queries here. The analyst query returns 9470.0803, while the no-uid query returns 9470.0766. So not sure I'm really doing the computation right here.

sd duids max min avg stddev
9470.0765 5369 27470.2953 3618.6551 8036.1940 4500.4135

Note that the amount of noise added here would be proportional to the max or avg divided by the number of distinct users (so max 5.1 and avg 1.5).


Further note: the SQL with-feature might be useful to extract the calculation of the average?

sebastian commented 4 years ago

Note: this is the last aggregate we need to support in order to be able to completely delete the uid-based anonymization pipeline. This will allow us to simplify and standardize the cloak internals (a worthy goal if you ask me!)

cristianberneanu commented 4 years ago

There is also the problem of min/max over date/time values.