Qbeast-io / qbeast-spark

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
https://qbeast.io/qbeast-our-tech/
Apache License 2.0
210 stars 19 forks source link

Implement the table tolerance function #12

Open cugni opened 3 years ago

cugni commented 3 years ago

What is Tolerance?

Tolerance marks the error a user allows in an aggregation, within a confidence interval. That means that, giving a CI of 95% for example, 95 of 100 times runs of the same query, the answer would have a relative error of [0.0, tolerance]

How do we calculate Tolerance?

The idea is with a user-provided tolerance value, we can estimate the required sample size to satisfy the query that computes the mean with a predefined level of certainty.

Given a confidence level of say 95%, we want to determine a confidence interval for which 95% of all our mean estimations will fall within the target range. That is, given a sample of size n drawn from a population (with and as population mean and variance respectively), determine the confidence interval of the sample mean so it has a 95% chance of containing

In other words;

Here, the Central Limit Theorem is taken into account:

Regardless of the distribution of the population(as long a and are finite), the distribution of the sample means is normal.

As well as the notion of Standard Error of the Mean:

Given a single sample of size n, how can we determine how far its mean $\bar{x}$ is from the population mean $\mu$? The answer, , reflects the standard deviation of the sample means and can be estimated as , with s being the standard deviation of the sample.

Tolerance is the Relative Standard Error (RSE) of the distribution of the sample means. The formula of the RSE can be expressed in terms of the Standard Error (SE) and the Estimated Mean ( ).

Consequently, the RSE can be estimated from the Standard Error ( ) of the Sample Mean and the Estimated Mean ( with the formula .

Another way to put it is; "we want that the error of the mean to be less than the tolerance applied to the estimated mean ( )";

Both ways lead to the same equation which allows determining the sample size as follows;

Standard Error of the Mean, , can be estimated as , with s being the standard deviation of the sample. It can be done because of the assumption of normality.

Deviation of the sample mean from the population mean is the SEM, and we want the percentage of error with respect to the mean, which should have tolerance as upper bound (ratio of the error of the SEM ). This gives us;

This issue has the scope to collect all the information about the table tolerance and guide a bit the future development. Missing steps.

alexeiakimov commented 3 years ago

Here are some issues found while working on the tolerance feature.

  1. Right now the tolerance is defined for the mean (or avg) function only. A similar concept for other types of aggregate functions like min, max, etc can have a different name and a different range of admissible values (for tolerance it is [0, 1]).
  2. The tolerance is defined as sampleDeviation * zScore / mean / sqrt(sampleSize). It is not clear if the tolerance is still efficient if the mean is 0 or is close to 0.
  3. Current implementation just extracts the column with avg function and calculate the mean using samples of the whole table. Suppose the user specified val df = spark.read.format("qbeast").load("...").where("value > 100").agg(avg("value")).tolerance(0.01). The sampling should apply the specified where condition otherwise the returned average can be wrong.
  4. zScore is hardcoded, should it be a parameter specified by user?
alexeiakimov commented 3 years ago

We should also define more precisely what kind of queries we want to support, so the user can have a clear understanding whether a given query is supported or not. Can we define it in terms of SQL syntax tree or alike, maybe a bit informal?