apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.22k stars 1.18k forks source link

Epic: Ordered Set Aggregate Functions #12824

Open jayzhan211 opened 3 weeks ago

jayzhan211 commented 3 weeks ago

Is your feature request related to a problem or challenge?

DataFusion doesn't support ordered-set aggregate functions yet.

Those functions are supported in Postgres, DuckDB, so I think we should support them in this repo as the core functions

Screenshot 2024-10-09 at 8 49 19 AM

All the aggregates listed in Table 9-51 ignore null values in their sorted input. For those that take a fraction parameter, the fraction value must be between 0 and 1; an error is thrown if not. However, a null fraction value simply produces a null result.

Related PR, https://github.com/apache/datafusion/issues/11732.

We have approx_percentile_cont and approx_percentile_cont_with_weight already, maybe we can start from these two functions

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Garamda commented 3 weeks ago

Hi, I'm new here. I want to take the issue or participate in developing one of those functions. Would it be possible?

jayzhan211 commented 3 weeks ago

Hi, I'm new here. I want to take the issue or participate in developing one of those functions. Would it be possible?

Sure, just type take then you will be assigned

Garamda commented 3 weeks ago

Alright. Before I take the issue, let me be clear. What I understood is that the things to be implemented are those five functions below,

Screenshot 2024-10-09 at 1 42 05 PM

not including those two functions, is that correct?

Screenshot 2024-10-09 at 1 25 45 PM

Let me double check because there are 7 tasks in total in this issue checklist.

jayzhan211 commented 3 weeks ago

I think the first step is to solve #11732, fix the syntax of approx_percentile_cont and approx_percentile_cont _with_weight since there is no WITHIN GROUP currently. And the next step is to support order by and other functions.

Garamda commented 3 weeks ago

I see. So there are 3 sub tasks, I think.

  1. fix the syntax of approx_percentile_cont and approx_percentile_cont_with_weight to support WITHIN GROUP (#11732)
  2. fix the same functions to support ORDER BY sort_expression
  3. After finishing step 1 & 2, implement new functions below (since WITHIN GROUP & ORDER BY will be reused)
    • mode() WITHIN GROUP (ORDER BY sort_expression)
    • percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)
    • percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression)
    • percentile_disc(fraction) WITHIN GROUP (ORDER BY sort_expression)
    • percentile_disc(fractions) WITHIN GROUP (ORDER BY sort_expression)

If there is nothing that I missed, I can start from step 1. So I think I can make a take comment on #11732 and get assigned, right?

(or if you want to reorganize sub tasks and make new issues for each one since it was changed into EPIC, I can wait.)

jonathanc-n commented 3 weeks ago

@Garamda @jayzhan211 After the WITHIN GROUP is added, I would like to work on this for the ORDER BY impelmentations for approx_percentile_cont and approx_percentile_cont_with_weight

Garamda commented 2 days ago

This task has taken longer than I initially expected, because recently I haven’t been able to fully focus on it. Nevertheless, I believe I can complete it successfully.

However, if this is a delay for the datafusion roadmap or has any negative effect on the ecosystem, I am open to passing this issue to a more experienced committer and shifting my focus to a other smaller issue.

If there’s still room for me to continue, I would like to see this issue through to completion.

I’m sharing this update to inform the current status, especially since some may be waiting for this issue’s completion.

cc. @jayzhan211 @jonathanc-n