Open jixxiong opened 4 months ago
Thanks @jixxiong !, could you pls add more informtaion so we can understand this function better? Scenario: please add your scenario as detailed as possible, for example: get the closest timestamped data or the earliest timestamped data in grouped statistics, or to create a candlestick chart of a stock. How to use it? the input parameter, out put parameter, their data type and how you want to deal with specific scenario like nulls... and also, If you have used it in other systems, we are more than glad to see you share your experience :)
Thanks @jixxiong !, could you pls add more informtaion so we can understand this function better? Scenario: please add your scenario as detailed as possible, for example: get the closest timestamped data or the earliest timestamped data in grouped statistics, or to create a candlestick chart of a stock. How to use it? the input parameter, out put parameter, their data type and how you want to deal with specific scenario like nulls... and also, If you have used it in other systems, we are more than glad to see you share your experience :)
Thanks for your reply.
In Starrocks, the MANN_WHITNEY_U_TEST
agg function is expected to perform the Mann-Whitney rank test on samples derived from two populations and return the Mann-Whitney U statistic as well as p-value associated with the test.
MANN_WHITNEY_U_TEST (sample_data, sample_treatment[, alternative[, continuity_correction]])
The MANN_WHITNEY_U_TEST
function takes a column sample_data
and column sample_treatment
as input, where sample_data
should be of numeric type and sample_treatment
should be of boolean type, with false
representing the first group and true
representing the second group. Two optional parameters are: alternative
to specify the alternative hypothesis (can be 'two-sided', 'less', or 'greater'), and continuity_correction
to specify whether to apply a continuity correction (default is True).
The MANN_WHITNEY_U_TEST
function should return the Mann-Whitney U statistic as well as p-value associated with the test in a json array of two elements.
This function should ignores NULLs.
Here are some examples.
create table testing_data (
id int,
score int,
treatment boolean
)
properties(
"replication_num" = "1"
);
insert into testing_data values
(1, 80, false),
(2, 100, false),
(3, NULL, false),
(4, 60, true),
(5, 70, true),
(6, 85, true);
Query:
SELECT MANN_WHITNEY_U_TEST(score, treatment) FROM testing_data;
Result:
+---------------------------------------+
| mann_whitney_u_test(score, treatment) |
+---------------------------------------+
| [5, 0.38647623077123283] |
+---------------------------------------+
Query:
SELECT MANN_WHITNEY_U_TEST(score, treatment, 'less') FROM testing_data;
Result:
+-----------------------------------------------+
| mann_whitney_u_test(score, treatment, 'less') |
+-----------------------------------------------+
| [5, 0.9255426634106172] |
+-----------------------------------------------+
Query:
SELECT MANN_WHITNEY_U_TEST(score, treatment, 'two-sided', 0) FROM testing_data;
Result:
+-------------------------------------------------------+
| mann_whitney_u_test(score, treatment, 'two-sided', 0) |
+-------------------------------------------------------+
| [5, 0.2482130789899235] |
+-------------------------------------------------------+
Feature request
Is your feature request related to a problem? Please describe.
Please refer to Whitney_U_test-wikipedia.
Describe the solution you'd like
The Mann-Whitney U test is a non-parametric test that can be used to determine if two populations were selected from the same distribution. The Whitney U (Wilcoxon-Mann-Whitney) test is a non-parametric test used to compare two independent samples. The calculation of the test statistic involves the following steps:
Combine all observations from both groups into a single dataset. Rank the combined dataset from smallest to largest, assigning the lowest rank to the smallest observation and so on. Sum the ranks of the observations in one group (group A). This sum is denoted as ( $R_A$ ). Calculate the sum of the ranks for the other group (group B), denoted as ( $R_B$ ). The sum of all ranks is ( $N(N+1)/2$ ), where ( $N$ ) is the total number of observations. Thus, ( $R_B = N(N+1)/2 - R_A$ ). Calculate the test statistic ( $U$ ) for each group using the following formulas: For group A: [ $U_A = R_A - \frac{n_A(n_A + 1)}{2}$ ]
For group B: [ $U_B = R_B - \frac{n_B(n_B + 1)}{2}$ ]
Where ( $n_A$ ) and ( $n_B$ ) are the sample sizes of group A and group B, respectively.
The smaller value of ( $U_A$ ) and ( $U_B$ ) is the test statistic ( U ). If this value is less than or equal to the critical value from the U distribution table, the null hypothesis is rejected, indicating a significant difference between the two groups. Otherwise, the null hypothesis is not rejected. By calculating the test statistic and comparing it to the critical values, the Whitney U test determines whether there is a statistically significant difference between the two independent samples, without assuming normality.
In Starrocks, the
MANN_WHITNEY_U_TEST
agg function is expected to perform the Mann-Whitney rank test on samples derived from two populations and return the Mann-Whitney U statistic as well as p-value associated with the test.The
MANN_WHITNEY_U_TEST
function takes a columnsample_data
and columnsample_treatment
as input, wheresample_data
should be of numeric type andsample_treatment
should be of boolean type, withfalse
representing the first group andtrue
representing the second group. Two optional parameters are:alternative
to specify the alternative hypothesis (can be 'two-sided', 'less', or 'greater'), andcontinuity_correction
to specify whether to apply a continuity correction (default is True).The
MANN_WHITNEY_U_TEST
function should return the Mann-Whitney U statistic as well as p-value associated with the test in a json array of two elements.This function should ignores NULLs.
Here are some examples.
Query:
Result:
Query:
Result:
Query:
Result:
Describe alternatives you've considered
Additional context