Implement the table tolerance function

What is Tolerance?

Tolerance marks the error a user allows in an aggregation, within a confidence interval. That means that, giving a CI of 95% for example, 95 of 100 times runs of the same query, the answer would have a relative error of [0.0, tolerance]

How do we calculate Tolerance?

The idea is with a user-provided tolerance value, we can estimate the required sample size to satisfy the query that computes the mean with a predefined level of certainty.

Given a confidence level of say 95%, we want to determine a confidence interval for which 95% of all our mean estimations will fall within the target range. That is, given a sample of size n drawn from a population (with $\mu$ and $\sigma$ as population mean and variance respectively), determine the confidence interval of the sample mean $\bar{x}$ so it has a 95% chance of containing $\mu$

$Confidence\ Interval = [\bar{x}-Z\frac{\sigma}{\sqrt{n}},\bar{x} Z\frac{\sigma}{\sqrt{n}}]$

In other words;

$Pr(\bar{x}-Z\frac{\sigma}{\sqrt{n}}<=\mu<=\bar{x} Z\frac{\sigma}{\sqrt{n}}) = 0.95$

Here, the Central Limit Theorem is taken into account:

Regardless of the distribution of the population(as long a $\mu$ and $\sigma$ are finite), the distribution of the sample means is normal.

As well as the notion of Standard Error of the Mean:

Given a single sample of size n, how can we determine how far its mean $\bar{x}$ is from the population mean $\mu$? The answer, $SEM=\frac{\sigma}{\sqrt{n}}$ , reflects the standard deviation of the sample means and can be estimated as $\frac{s}{\sqrt{n}}$ , with s being the standard deviation of the sample.

Tolerance is the Relative Standard Error (RSE) of the distribution of the sample means. The formula of the RSE can be expressed in terms of the Standard Error (SE) and the Estimated Mean ( $\bar{x}$ ).

Consequently, the RSE can be estimated from the Standard Error ( $SE=\frac{\sigma}{\sqrt{n}}$ ) of the Sample Mean and the Estimated Mean ( $\bar{x}$ with the formula $RSE = \frac{SE}{\bar{x}}*100$ .

$RSE=\frac{Z\frac{\sigma}{\sqrt{n}}}{\bar{x}} \le t$

Another way to put it is; "we want that the error of the mean $\frac{\sigma}{\sqrt{n}}$ to be less than the tolerance applied to the estimated mean ( $\bar{x}*t$ )";

$\bar{x}*t=Z\frac{\sigma}{\sqrt{n}}$ $t = \frac{Z*\frac{\sigma}{\sqrt{n}}}{\bar{x}}$

Both ways lead to the same equation which allows determining the sample size as follows;

$\sqrt{n} = \frac{Z*\sigma}{t*\bar{x}}$ $n=(\frac{{Z*\sigma}}{t*\bar{x}})^2$ $n=\frac{1}{t^2}(\frac{{Z*\sigma}}{\bar{x}})^2$

Standard Error of the Mean, $SEM=\frac{\sigma}{\sqrt{n}}$ , can be estimated as $\frac{s}{\sqrt{n}}$ , with s being the standard deviation of the sample. It can be done because of the assumption of normality.

$n=\frac{1}{t^2}(\frac{{Z*s}}{\bar{x}})^2$

Deviation of the sample mean from the population mean is the SEM, and we want the percentage of error with respect to the mean, which should have tolerance as upper bound (ratio of the error of the SEM $\le tolerance$ ). This gives us;

$\frac{SEM}{\bar{x}} \le tolerance$

This issue has the scope to collect all the information about the table tolerance and guide a bit the future development. Missing steps.

[ ] formally define the algorithm for the table tolerance
[ ] Implement the algorithm inside the qbeast-spark library.
[ ] finish the .tolerance shortcut io.qbeast.spark.implicts
[ ] Set up a comprehensive confidence testing.

Here are some issues found while working on the tolerance feature.

Right now the tolerance is defined for the mean (or avg) function only. A similar concept for other types of aggregate functions like min, max, etc can have a different name and a different range of admissible values (for tolerance it is [0, 1]).
The tolerance is defined as sampleDeviation * zScore / mean / sqrt(sampleSize). It is not clear if the tolerance is still efficient if the mean is 0 or is close to 0.
Current implementation just extracts the column with avg function and calculate the mean using samples of the whole table. Suppose the user specified val df = spark.read.format("qbeast").load("...").where("value > 100").agg(avg("value")).tolerance(0.01). The sampling should apply the specified where condition otherwise the returned average can be wrong.
zScore is hardcoded, should it be a parameter specified by user?

Qbeast-io / qbeast-spark

Implement the table tolerance function #12