[Feature] Support for Correlation Function

SabrinaZhaozyf commented 2 years ago

Add support for calculating Pearson's coefficient corr(x,y) as part of the effort in #8493. Can leverage implementation for COVAR_SAMP and COVAR_POP as mentioned in #9236.

VenkatDatta commented 2 years ago

I'm new to the community and would love to start contributing with this task.

Can i work on the task?

siddharthteotia commented 2 years ago

Yes definitely. We can help you

@jasperjiaguo / @SabrinaZhaozyf - can you please share any information, previous PRs etc that can help @VenkatDatta get started.

SabrinaZhaozyf commented 2 years ago

Hi @VenkatDatta, thank you for taking this up! Hopefully, the following information can help you get started:)

Definition https://en.wikipedia.org/wiki/Correlation. In Pinot, correlation can be used to describe the dependence/association of two columns.

Support in Existing DBs Presto: https://prestodb.io/docs/current/functions/aggregate.html#statistical-aggregate-functions Postgres: https://www.postgresql.org/docs/9.4/functions-aggregate.html Pinot should follow the same syntax: corr(x, y) -> DOUBLE

Calculation Formula for correlation can be found in https://en.wikipedia.org/wiki/Correlation. You could also think of it as a normalized covariance. corr(x, y) = cov(x, y) / (std(x) * std(y))

Related PRs

Background: #8493
Covariance: #9236
Histogram: #8724

Good Starting Point

Run/step through unit tests for aggregation functions under src/test/java/org/apache/pinot/queries to familiarize with how aggregation happens distributedly
- See how aggregate and merge update the states we are keeping track and how are they different?
- Helpful breakpoints org/apache/pinot/core/query/reduce/AggregationDataTableReducer.java org/apache/pinot/core/operator/combine/BaseCombineOperator.java org/apache/pinot/core/operator/query/AggregationOperator.java
Extend CovarianceTuple to add fields(sum of squares?) that help calculate the standard deviation for the 2 columns
Also helpful to look at other classes in org/apache/pinot/segment/local/customobject
Also good to look at other aggregation functions under org/apache/pinot/core/query/aggregation/function and see how they are wired in

Testing

Should have very similar structure as CovarianceQueriesTest.java
Want to test on both individual segments and inter-segments to make sure both intermediate results and reduced results are correct
Make sure to test on distinct servers
- Please feel free to ask me any questions! This part can be a bit tricky.

Please let me / @jasperjiaguo know if you have any questions!

VenkatDatta commented 2 years ago

Got it, i will go through the steps mentioned.

Thanks @SabrinaZhaozyf , @siddharthteotia for sharing the resources :)

subkanthi commented 1 year ago

working on this one, should have a PR in a few days with tests.

siddharthteotia commented 1 year ago

Hi @subkanthi - just curious if you are planning to put out a PR for this.

subkanthi commented 1 year ago

Hi @subkanthi - just curious if you are planning to put out a PR for this.

Hi @siddharthteotia , should have the PR up in a day, finishing the last test.

subkanthi commented 1 year ago

@siddharthteotia while testing, noticed that with the following implementation get a lot of NaN values, but with apache commons math PearsonCorrelation class its not NaN, just like the skewness and kurtosis implementation trying to change to use the apache commons function. https://github.com/subkanthi/pinot/pull/1

            double sumX = correlationTuple.getSumX();
            double sumY = correlationTuple.getSumY();
            double sumXY = correlationTuple.getSumXY();
            double squareSumX = correlationTuple.getSquareSumX();
            double squareSumY = correlationTuple.getSquareSumY();

            double bottom = Math.sqrt((count * squareSumX
                    - sumX * sumX) * (count * squareSumY - sumY * sumY));

            if (bottom == 0) {
                return 0d;
            }
            double top = count * sumXY - sumX * sumY;
            return top / bottom;
        }

siddharthteotia commented 1 year ago

May be you are running into divide by 0 problem ? Did you try step into the code to understand where it is turning into NaN ?

cc @jasperjiaguo / @SabrinaZhaozyf

jasperjiaguo commented 1 year ago

NaN could be due to Math.sqrt(negative_number) or 0.0/0.0 We have recently discovered this by-definition impl of covariance/correlation has numerical stability issue when E[x^2] ~ E[x]^2 >> 0 (see 1 2 3). Could we also use similar implementations? I'm not sure if the online algorithm is already available as a library, but feel free to use if apache common has it.

subkanthi commented 1 year ago

Thanks @jasperjiaguo , will compare the Trino implementation, Apache commons does expose PearsonCorrelation. will update the PR in a day or two.

apache / pinot

[Feature] Support for Correlation Function #9277