Closed maxime-petitjean closed 1 month ago
Attention: Patch coverage is 61.90476%
with 8 lines
in your changes are missing coverage. Please review.
Project coverage is 85.21%. Comparing base (
d153668
) to head (fabaf39
).:exclamation: Current head fabaf39 differs from pull request most recent head 01fcf73. Consider uploading reports for the commit 01fcf73 to get more accurate results
Hi @maxime-petitjean let me know when this PR is ready for a re-review!
Hi @maxime-petitjean let me know when this PR is ready for a re-review!
I might need help with some parts.
I would like to create a column of arrays when using approx_percentile([0.2, 0.4])
and a column of floats when using approx_percentile(0.2)
. Right now I always create a column of arrays (maybe it's what we want).
I also handle python objects (to convert the q
parameter) and I'm sure it can be done much nicer (and probably in high level code).
I added a conversion function from Vec<Option<Vec<Option<f64>>>>
to a ListArray
because I didn't found an helper and it either already exists and I missed it or it should be rewrite in a more generic way.
So if you could have a look it would be very helpful!
Here is an example with the last push (it handles multi percentiles computation 🎉):
from daft import col, from_pydict
print(
from_pydict(
{
"numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"],
}
)
.groupby("groups")
.agg([col("numbers").approx_percentile([0.25, 0.5]).alias("percentiles")])
.sort("groups")
.collect()
)
╭────────┬────────────────────────────────╮
│ groups ┆ percentiles │
│ --- ┆ --- │
│ Utf8 ┆ List[Float64] │
╞════════╪════════════════════════════════╡
│ g1 ┆ [1.993661701417351, 2.9742334… │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ g2 ┆ [7.028793021534831, 7.9249737… │
╰────────┴────────────────────────────────╯
Hi @maxime-petitjean let me know when this PR is ready for a re-review!
I might need help with some parts. I would like to create a column of arrays when using
approx_percentile([0.2, 0.4])
and a column of floats when usingapprox_percentile(0.2)
. Right now I always create a column of arrays (maybe it's what we want). I also handle python objects (to convert theq
parameter) and I'm sure it can be done much nicer (and probably in high level code). I added a conversion function fromVec<Option<Vec<Option<f64>>>>
to aListArray
because I didn't found an helper and it either already exists and I missed it or it should be rewrite in a more generic way.So if you could have a look it would be very helpful!
Here is an example with the last push (it handles multi percentiles computation 🎉):
from daft import col, from_pydict print( from_pydict( { "numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"], } ) .groupby("groups") .agg([col("numbers").approx_percentile([0.25, 0.5]).alias("percentiles")]) .sort("groups") .collect() )
╭────────┬────────────────────────────────╮ │ groups ┆ percentiles │ │ --- ┆ --- │ │ Utf8 ┆ List[Float64] │ ╞════════╪════════════════════════════════╡ │ g1 ┆ [1.993661701417351, 2.9742334… │ ├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ g2 ┆ [7.028793021534831, 7.9249737… │ ╰────────┴────────────────────────────────╯
Nice and great work! Looks like it's working quite well already :)
PercentileEvaluator::to_field
we need to dynamically resolve the output field's type to either a Float64 or a List[Float64], depending on the input q
type.
b. We can split StructArray::sketch_percentile
into a sketch_percentile_single
and a sketch_percentile_multi
which returns a FloatArray
and a ListArray
respectively. Then we can dynamically decide which variant to call, depending on the incoming datatype of q
here.Vec<Option<Vec<Option<f64>>>>
to a ListArray
" - I think this is fine for now.... But we need a better way to iteratively construct Arrays from Rust primitives (e.g. an Appendable
trait like how we already have a Growable
trait).Let me know when (1) is done and we can probably do a round of reviews to get this through!
Edit: for (3), one way you could do it is to grow an arrow2
array first, and then use ListArray::from_arrow
to convert it into a ListArray
. Some useful arrow2 utilities to do this: MutableListArray and MutablePrimitiveArray
One more idea for (3):
We can also use our FixedSizeList
type instead, since we can actually determine the size of each list up-front! (this is just the length of the incoming list of percentiles)
FixedSizeList::new()
is also quite an easy API. It takes in:
length = n * len_of_array
i.e. data is padded for any nulls in the column)MutableBitmap
)Let me know what you think!
Merged as #2179 !
Description
Here is a proposal of adding approximative quantile aggregation. This new aggregation should be fully distributed and fast to execute. I would love to have some feedback on the approach and how to make it better!
I have added a dependency on a DDSketch library to be able to distribute approximative quantile computation. I chose this library because it's very simple and coded in Rust, but there are alternatives and more advanced libraries that should be considered (for example UDDSketch).
Sketches are stored in a
Binary
column and it should be considered to add a datatype to be more specific on schema checks (right now every Binary column can be considered as a Sketch column) or maybe use a struct datatype. To convert a sketch to binary, I'm usingserde_json
(because I didn't want to add a new dependency) which is probably not the best serialisation for a sketch.New features
ApproxSketch
aggregation (published toapprox_sketch
) which computes an aggregated sketch from a numeric column.MergeSketch
aggregation (private) which merges multiple sketches into a single one.sketch_quantile
function which extracts quantile from a sketch.Of course, all names should be reworked to be more in line with
Daft
names.Example
Note
It's the first time I'm playing with Rust (I'm a C++/Javascript developer) so any feedback, tip or good practice is welcome!