Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.79k stars 108 forks source link

[FEAT] Add approximative quantile aggregation #2076

Closed maxime-petitjean closed 1 month ago

maxime-petitjean commented 1 month ago

Description

Here is a proposal of adding approximative quantile aggregation. This new aggregation should be fully distributed and fast to execute. I would love to have some feedback on the approach and how to make it better!

I have added a dependency on a DDSketch library to be able to distribute approximative quantile computation. I chose this library because it's very simple and coded in Rust, but there are alternatives and more advanced libraries that should be considered (for example UDDSketch).

Sketches are stored in a Binary column and it should be considered to add a datatype to be more specific on schema checks (right now every Binary column can be considered as a Sketch column) or maybe use a struct datatype. To convert a sketch to binary, I'm using serde_json (because I didn't want to add a new dependency) which is probably not the best serialisation for a sketch.

New features

Of course, all names should be reworked to be more in line with Daft names.

Example

from daft import from_pydict, col

print(
    from_pydict(
        {
            "numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"],
        }
    )
    .groupby("groups")
    .agg([col("numbers").approx_sketch().alias("sketches")])
    .with_column("first_quartile", col("sketches").sketch_quantile(0.25))
    .with_column("median", col("sketches").sketch_quantile(0.5))
    .with_column("third_quartile", col("sketches").sketch_quantile(0.75))
    .select("groups", "first_quartile", "median", "third_quartile")
    .sort("groups")
    .collect()
)
╭────────┬───────────────────┬────────────────────┬───────────────────╮                                                                                                           
│ groups ┆ first_quartile    ┆ median             ┆ third_quartile    │                                                                                                           
│ ---    ┆ ---               ┆ ---                ┆ ---               │                                                                                                           
│ Utf8   ┆ Float64           ┆ Float64            ┆ Float64           │                                                                                                           
╞════════╪═══════════════════╪════════════════════╪═══════════════════╡                                                                                                           
│ g1     ┆ 1.993661701417351 ┆ 2.9742334234767167 ┆ 4.014835333028612 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ g2     ┆ 7.028793021534831 ┆ 7.924973703917148  ┆ 8.935418643763665 │
╰────────┴───────────────────┴────────────────────┴───────────────────╯

Note

It's the first time I'm playing with Rust (I'm a C++/Javascript developer) so any feedback, tip or good practice is welcome!

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 61.90476% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 85.21%. Comparing base (d153668) to head (fabaf39).

:exclamation: Current head fabaf39 differs from pull request most recent head 01fcf73. Consider uploading reports for the commit 01fcf73 to get more accurate results

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/2076/graphs/tree.svg?width=650&height=150&src=pr&token=J430QVFE89&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/2076?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) ```diff @@ Coverage Diff @@ ## main #2076 +/- ## ========================================== + Coverage 85.14% 85.21% +0.07% ========================================== Files 68 68 Lines 7358 7279 -79 ========================================== - Hits 6265 6203 -62 + Misses 1093 1076 -17 ``` | [Files](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/2076?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) | Coverage Δ | | |---|---|---| | [daft/expressions/expressions.py](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/2076?src=pr&el=tree&filepath=daft%2Fexpressions%2Fexpressions.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-ZGFmdC9leHByZXNzaW9ucy9leHByZXNzaW9ucy5weQ==) | `91.97% <85.71%> (-0.94%)` | :arrow_down: | | [daft/dataframe/dataframe.py](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/2076?src=pr&el=tree&filepath=daft%2Fdataframe%2Fdataframe.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-ZGFmdC9kYXRhZnJhbWUvZGF0YWZyYW1lLnB5) | `88.73% <71.42%> (-0.19%)` | :arrow_down: | | [daft/series.py](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/2076?src=pr&el=tree&filepath=daft%2Fseries.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-ZGFmdC9zZXJpZXMucHk=) | `91.98% <28.57%> (-1.25%)` | :arrow_down: | ... and [19 files with indirect coverage changes](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/2076/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)
jaychia commented 1 month ago

Hi @maxime-petitjean let me know when this PR is ready for a re-review!

maxime-petitjean commented 1 month ago

Hi @maxime-petitjean let me know when this PR is ready for a re-review!

I might need help with some parts. I would like to create a column of arrays when using approx_percentile([0.2, 0.4]) and a column of floats when using approx_percentile(0.2). Right now I always create a column of arrays (maybe it's what we want). I also handle python objects (to convert the q parameter) and I'm sure it can be done much nicer (and probably in high level code). I added a conversion function from Vec<Option<Vec<Option<f64>>>> to a ListArray because I didn't found an helper and it either already exists and I missed it or it should be rewrite in a more generic way.

So if you could have a look it would be very helpful!

Here is an example with the last push (it handles multi percentiles computation 🎉):

from daft import col, from_pydict

print(
    from_pydict(
        {
            "numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"],
        }
    )
    .groupby("groups")
    .agg([col("numbers").approx_percentile([0.25, 0.5]).alias("percentiles")])
    .sort("groups")
    .collect()
)
╭────────┬────────────────────────────────╮                                                                                                                                       
│ groups ┆ percentiles                    │                                                                                                                                       
│ ---    ┆ ---                            │                                                                                                                                       
│ Utf8   ┆ List[Float64]                  │                                                                                                                                       
╞════════╪════════════════════════════════╡                                                                                                                                       
│ g1     ┆ [1.993661701417351, 2.9742334… │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ g2     ┆ [7.028793021534831, 7.9249737… │
╰────────┴────────────────────────────────╯
jaychia commented 1 month ago

Hi @maxime-petitjean let me know when this PR is ready for a re-review!

I might need help with some parts. I would like to create a column of arrays when using approx_percentile([0.2, 0.4]) and a column of floats when using approx_percentile(0.2). Right now I always create a column of arrays (maybe it's what we want). I also handle python objects (to convert the q parameter) and I'm sure it can be done much nicer (and probably in high level code). I added a conversion function from Vec<Option<Vec<Option<f64>>>> to a ListArray because I didn't found an helper and it either already exists and I missed it or it should be rewrite in a more generic way.

So if you could have a look it would be very helpful!

Here is an example with the last push (it handles multi percentiles computation 🎉):

from daft import col, from_pydict

print(
    from_pydict(
        {
            "numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"],
        }
    )
    .groupby("groups")
    .agg([col("numbers").approx_percentile([0.25, 0.5]).alias("percentiles")])
    .sort("groups")
    .collect()
)
╭────────┬────────────────────────────────╮                                                                                                                                       
│ groups ┆ percentiles                    │                                                                                                                                       
│ ---    ┆ ---                            │                                                                                                                                       
│ Utf8   ┆ List[Float64]                  │                                                                                                                                       
╞════════╪════════════════════════════════╡                                                                                                                                       
│ g1     ┆ [1.993661701417351, 2.9742334… │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ g2     ┆ [7.028793021534831, 7.9249737… │
╰────────┴────────────────────────────────╯

Nice and great work! Looks like it's working quite well already :)

  1. To "create a column of arrays" when a list is provided, and "a column of floats" when a float is provided: a. In PercentileEvaluator::to_field we need to dynamically resolve the output field's type to either a Float64 or a List[Float64], depending on the input q type. b. We can split StructArray::sketch_percentile into a sketch_percentile_single and a sketch_percentile_multi which returns a FloatArray and a ListArray respectively. Then we can dynamically decide which variant to call, depending on the incoming datatype of q here.
  2. "I also handle python objects" - I didn't quite catch this, where is this happening?
  3. "I added a conversion function from Vec<Option<Vec<Option<f64>>>> to a ListArray" - I think this is fine for now.... But we need a better way to iteratively construct Arrays from Rust primitives (e.g. an Appendable trait like how we already have a Growable trait).

Let me know when (1) is done and we can probably do a round of reviews to get this through!

Edit: for (3), one way you could do it is to grow an arrow2 array first, and then use ListArray::from_arrow to convert it into a ListArray. Some useful arrow2 utilities to do this: MutableListArray and MutablePrimitiveArray

jaychia commented 1 month ago

One more idea for (3):

We can also use our FixedSizeList type instead, since we can actually determine the size of each list up-front! (this is just the length of the incoming list of percentiles)

FixedSizeList::new() is also quite an easy API. It takes in:

  1. The field (name + datatype)
  2. The child Series (this will have type Float64, and have length = n * len_of_array i.e. data is padded for any nulls in the column)
  3. The validity of the array (this is just an arrow2 bitmask, which you can grow as well using: MutableBitmap)

Let me know what you think!

jaychia commented 1 month ago

Merged as #2179 !