emanueledellavalle / streaming-data-analytics

Apache License 2.0
30 stars 11 forks source link

Add to SDE theory Computing Approximate Quantile in streaming #27

Open emanueledellavalle opened 6 months ago

emanueledellavalle commented 6 months ago

Quantile is not a streaming operator because it needs to sort the data, so it has a space complexity of O(n). It can be distributed only by splitting the dataset into its columns and then merging the results.

Approximate Quantile returns an approximation of the quantile value of the columns. There are two implementations, P2 [1] computes the quantile keeping in the state a fixed number of marks, and GK [2], which has a bounded error, needs to store a fraction of the dataset. P2 is streaming but can only be distributed by splitting the columns of the dataset. GK has a space complexity of O( 1/ε log(εn)) but can be distributed independently from the number of columns.

[1] Raj Jain and Imrich Chlamtac. The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations. Commun. ACM, 28(10):1076–1085, oct 1985. [2] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2):58–66, 2001.