StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.66k stars 1.75k forks source link

support mann_whitney_u_test aggregate function #45513

Open jixxiong opened 4 months ago

jixxiong commented 4 months ago

Feature request

Is your feature request related to a problem? Please describe.

Please refer to Whitney_U_test-wikipedia.

Describe the solution you'd like

The Mann-Whitney U test is a non-parametric test that can be used to determine if two populations were selected from the same distribution. The Whitney U (Wilcoxon-Mann-Whitney) test is a non-parametric test used to compare two independent samples. The calculation of the test statistic involves the following steps:

Combine all observations from both groups into a single dataset. Rank the combined dataset from smallest to largest, assigning the lowest rank to the smallest observation and so on. Sum the ranks of the observations in one group (group A). This sum is denoted as ( $R_A$ ). Calculate the sum of the ranks for the other group (group B), denoted as ( $R_B$ ). The sum of all ranks is ( $N(N+1)/2$ ), where ( $N$ ) is the total number of observations. Thus, ( $R_B = N(N+1)/2 - R_A$ ). Calculate the test statistic ( $U$ ) for each group using the following formulas: For group A: [ $U_A = R_A - \frac{n_A(n_A + 1)}{2}$ ]

For group B: [ $U_B = R_B - \frac{n_B(n_B + 1)}{2}$ ]

Where ( $n_A$ ) and ( $n_B$ ) are the sample sizes of group A and group B, respectively.

The smaller value of ( $U_A$ ) and ( $U_B$ ) is the test statistic ( U ). If this value is less than or equal to the critical value from the U distribution table, the null hypothesis is rejected, indicating a significant difference between the two groups. Otherwise, the null hypothesis is not rejected. By calculating the test statistic and comparing it to the critical values, the Whitney U test determines whether there is a statistically significant difference between the two independent samples, without assuming normality.

In Starrocks, the MANN_WHITNEY_U_TEST agg function is expected to perform the Mann-Whitney rank test on samples derived from two populations and return the Mann-Whitney U statistic as well as p-value associated with the test.

MANN_WHITNEY_U_TEST (sample_data, sample_treatment[, alternative[, continuity_correction]])

The MANN_WHITNEY_U_TEST function takes a column sample_data and column sample_treatment as input, where sample_data should be of numeric type and sample_treatment should be of boolean type, with false representing the first group and true representing the second group. Two optional parameters are: alternative to specify the alternative hypothesis (can be 'two-sided', 'less', or 'greater'), and continuity_correction to specify whether to apply a continuity correction (default is True).

The MANN_WHITNEY_U_TEST function should return the Mann-Whitney U statistic as well as p-value associated with the test in a json array of two elements.

This function should ignores NULLs.

Here are some examples.

create table testing_data (
    id int, 
    score int, 
    treatment boolean
)
properties(
    "replication_num" = "1"
);

insert into testing_data values 
    (1, 80, false), 
    (2, 100, false), 
    (3, NULL, false), 
    (4, 60, true), 
    (5, 70, true), 
    (6, 85, true);

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment) FROM testing_data;

Result:

+---------------------------------------+
| mann_whitney_u_test(score, treatment) |
+---------------------------------------+
| [5, 0.38647623077123283]              |
+---------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'less') FROM testing_data;

Result:

+-----------------------------------------------+
| mann_whitney_u_test(score, treatment, 'less') |
+-----------------------------------------------+
| [5, 0.9255426634106172]                       |
+-----------------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'two-sided', 0) FROM testing_data;

Result:

+-------------------------------------------------------+
| mann_whitney_u_test(score, treatment, 'two-sided', 0) |
+-------------------------------------------------------+
| [5, 0.2482130789899235]                               |
+-------------------------------------------------------+

Describe alternatives you've considered

Additional context

wangsimo0 commented 4 months ago

Thanks @jixxiong !, could you pls add more informtaion so we can understand this function better? Scenario: please add your scenario as detailed as possible, for example: get the closest timestamped data or the earliest timestamped data in grouped statistics, or to create a candlestick chart of a stock. How to use it? the input parameter, out put parameter, their data type and how you want to deal with specific scenario like nulls... and also, If you have used it in other systems, we are more than glad to see you share your experience :)

jixxiong commented 3 months ago

Thanks @jixxiong !, could you pls add more informtaion so we can understand this function better? Scenario: please add your scenario as detailed as possible, for example: get the closest timestamped data or the earliest timestamped data in grouped statistics, or to create a candlestick chart of a stock. How to use it? the input parameter, out put parameter, their data type and how you want to deal with specific scenario like nulls... and also, If you have used it in other systems, we are more than glad to see you share your experience :)

Thanks for your reply.

In Starrocks, the MANN_WHITNEY_U_TEST agg function is expected to perform the Mann-Whitney rank test on samples derived from two populations and return the Mann-Whitney U statistic as well as p-value associated with the test.

MANN_WHITNEY_U_TEST (sample_data, sample_treatment[, alternative[, continuity_correction]])

The MANN_WHITNEY_U_TEST function takes a column sample_data and column sample_treatment as input, where sample_data should be of numeric type and sample_treatment should be of boolean type, with false representing the first group and true representing the second group. Two optional parameters are: alternative to specify the alternative hypothesis (can be 'two-sided', 'less', or 'greater'), and continuity_correction to specify whether to apply a continuity correction (default is True).

The MANN_WHITNEY_U_TEST function should return the Mann-Whitney U statistic as well as p-value associated with the test in a json array of two elements.

This function should ignores NULLs.

Here are some examples.

create table testing_data (
    id int, 
    score int, 
    treatment boolean
)
properties(
    "replication_num" = "1"
);

insert into testing_data values 
    (1, 80, false), 
    (2, 100, false), 
    (3, NULL, false), 
    (4, 60, true), 
    (5, 70, true), 
    (6, 85, true);

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment) FROM testing_data;

Result:

+---------------------------------------+
| mann_whitney_u_test(score, treatment) |
+---------------------------------------+
| [5, 0.38647623077123283]              |
+---------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'less') FROM testing_data;

Result:

+-----------------------------------------------+
| mann_whitney_u_test(score, treatment, 'less') |
+-----------------------------------------------+
| [5, 0.9255426634106172]                       |
+-----------------------------------------------+

Query:

SELECT MANN_WHITNEY_U_TEST(score, treatment, 'two-sided', 0) FROM testing_data;

Result:

+-------------------------------------------------------+
| mann_whitney_u_test(score, treatment, 'two-sided', 0) |
+-------------------------------------------------------+
| [5, 0.2482130789899235]                               |
+-------------------------------------------------------+