Closed osopardo1 closed 1 week ago
I am changing the title from Histogram to Approximate Quantile. It would be more adequate to use this method instead of the histogram to infer new statistics from the table and to transform the rows at writing time.
Before implementing, I will prepare a document regarding the existing algorithms for the Approximate Quantiles.
Spark already has an approximateQuantile method, but unfortunately only works with numerical values.
Leaving this issue On Hold: We will need to review the design document before implementing. Some concerns regarding algorithms and Strings needs to be processed.
This issue would only be regarding a simple implementation of the API. It would not include any logic of updating the Quantiles and so on.
Same as we did for Histograms #230 , we will add a different type of transformation for quantiles.
columnStats
, computed under the hood in the first write, or using a default set of percentiles which will compress from<numberType>.min
to <numberType>.max
.Since the issue has diverged from the original scope, I would open a new one
Closing it because we changed to #416
After analyzing the efficiency of distribution functions for indexing (see issue #336 ), we can start implementing the
QuantileTransformation
.The idea is to build it as another type of transformation, and eventually turn it into the default.
The API can be something like:
Or we can use the word approx:
And under the hood:
We would take advantage of the first step of
OTreeDataAnalyzer
and compute an approximate quantiles of the columns specified.