Stats based standard deviation

sebastian commented 5 years ago

A design was present in the original no-uid approach issue. I am unable to link directly to the section, but you get there by searching for "Standard Deviation of a column".

Excerpt:

Analyst query:

SELECT stddev(amount)
FROM transactions

Stats-based query:

SELECT sqrt(sum(sm)/sum(cnt)) AS sd,
       count(DISTINCT client_id) AS duids,
       max(sqrt(sm/cnt)), min(sqrt(sm/cnt)),
       avg(sqrt(sm/cnt)), stddev(sqrt(sm/cnt))
FROM
   (SELECT client_id, sum(diff) AS sm, count(*) AS cnt
    FROM
      (SELECT client_id, pow(abs(amount - (SELECT avg(amount)
                                           FROM transactions)), 2) AS diff
       FROM transactions
      ) t1
    GROUP BY client_id
   ) t3

Note that the average needed for the standard deviation computation is computed as a SELECT embedded in the abs function. My postgres is smart enough to store the computed value and re-use, but not every implementation might.

There is a small difference in the answers of the analyst and no-uid queries here. The analyst query returns 9470.0803, while the no-uid query returns 9470.0766. So not sure I'm really doing the computation right here.

sd	duids	max	min	avg	stddev
9470.0765	5369	27470.2953	3618.6551	8036.1940	4500.4135

Note that the amount of noise added here would be proportional to the max or avg divided by the number of distinct users (so max 5.1 and avg 1.5).

Further note: the SQL with-feature might be useful to extract the calculation of the average?

sebastian commented 4 years ago

Note: this is the last aggregate we need to support in order to be able to completely delete the uid-based anonymization pipeline. This will allow us to simplify and standardize the cloak internals (a worthy goal if you ask me!)

cristianberneanu commented 4 years ago

There is also the problem of min/max over date/time values.

Aircloak / aircloak

Stats based standard deviation #3863