apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.47k stars 1.28k forks source link

[Feature] Support for Correlation Function #9277

Open SabrinaZhaozyf opened 2 years ago

SabrinaZhaozyf commented 2 years ago

Add support for calculating Pearson's coefficient corr(x,y) as part of the effort in #8493. Can leverage implementation for COVAR_SAMP and COVAR_POP as mentioned in #9236.

VenkatDatta commented 2 years ago

I'm new to the community and would love to start contributing with this task.

Can i work on the task?

siddharthteotia commented 2 years ago

Yes definitely. We can help you

@jasperjiaguo / @SabrinaZhaozyf - can you please share any information, previous PRs etc that can help @VenkatDatta get started.

SabrinaZhaozyf commented 2 years ago

Hi @VenkatDatta, thank you for taking this up! Hopefully, the following information can help you get started:)

Definition https://en.wikipedia.org/wiki/Correlation. In Pinot, correlation can be used to describe the dependence/association of two columns.

Support in Existing DBs Presto: https://prestodb.io/docs/current/functions/aggregate.html#statistical-aggregate-functions Postgres: https://www.postgresql.org/docs/9.4/functions-aggregate.html Pinot should follow the same syntax: corr(x, y) -> DOUBLE

Calculation Formula for correlation can be found in https://en.wikipedia.org/wiki/Correlation. You could also think of it as a normalized covariance. corr(x, y) = cov(x, y) / (std(x) * std(y))

Related PRs

Good Starting Point

Testing

Please let me / @jasperjiaguo know if you have any questions!

VenkatDatta commented 2 years ago

Got it, i will go through the steps mentioned.

Thanks @SabrinaZhaozyf , @siddharthteotia for sharing the resources :)

subkanthi commented 1 year ago

working on this one, should have a PR in a few days with tests.

siddharthteotia commented 1 year ago

Hi @subkanthi - just curious if you are planning to put out a PR for this.

subkanthi commented 1 year ago

Hi @subkanthi - just curious if you are planning to put out a PR for this.

Hi @siddharthteotia , should have the PR up in a day, finishing the last test.

subkanthi commented 1 year ago

@siddharthteotia while testing, noticed that with the following implementation get a lot of NaN values, but with apache commons math PearsonCorrelation class its not NaN, just like the skewness and kurtosis implementation trying to change to use the apache commons function. https://github.com/subkanthi/pinot/pull/1

            double sumX = correlationTuple.getSumX();
            double sumY = correlationTuple.getSumY();
            double sumXY = correlationTuple.getSumXY();
            double squareSumX = correlationTuple.getSquareSumX();
            double squareSumY = correlationTuple.getSquareSumY();

            double bottom = Math.sqrt((count * squareSumX
                    - sumX * sumX) * (count * squareSumY - sumY * sumY));

            if (bottom == 0) {
                return 0d;
            }
            double top = count * sumXY - sumX * sumY;
            return top / bottom;
        }
siddharthteotia commented 1 year ago

May be you are running into divide by 0 problem ? Did you try step into the code to understand where it is turning into NaN ?

cc @jasperjiaguo / @SabrinaZhaozyf

jasperjiaguo commented 1 year ago

NaN could be due to Math.sqrt(negative_number) or 0.0/0.0 We have recently discovered this by-definition impl of covariance/correlation has numerical stability issue when E[x^2] ~ E[x]^2 >> 0 (see 1 2 3). Could we also use similar implementations? I'm not sure if the online algorithm is already available as a library, but feel free to use if apache common has it.

subkanthi commented 1 year ago

Thanks @jasperjiaguo , will compare the Trino implementation, Apache commons does expose PearsonCorrelation. will update the PR in a day or two.