Open SabrinaZhaozyf opened 2 years ago
I'm new to the community and would love to start contributing with this task.
Can i work on the task?
Yes definitely. We can help you
@jasperjiaguo / @SabrinaZhaozyf - can you please share any information, previous PRs etc that can help @VenkatDatta get started.
Hi @VenkatDatta, thank you for taking this up! Hopefully, the following information can help you get started:)
Definition https://en.wikipedia.org/wiki/Correlation. In Pinot, correlation can be used to describe the dependence/association of two columns.
Support in Existing DBs
Presto: https://prestodb.io/docs/current/functions/aggregate.html#statistical-aggregate-functions
Postgres: https://www.postgresql.org/docs/9.4/functions-aggregate.html
Pinot should follow the same syntax: corr(x, y)
-> DOUBLE
Calculation
Formula for correlation can be found in https://en.wikipedia.org/wiki/Correlation. You could also think of it as a normalized covariance. corr(x, y) = cov(x, y) / (std(x) * std(y))
Related PRs
Good Starting Point
src/test/java/org/apache/pinot/queries
to familiarize with how aggregation happens distributedly
org/apache/pinot/core/query/reduce/AggregationDataTableReducer.java
org/apache/pinot/core/operator/combine/BaseCombineOperator.java
org/apache/pinot/core/operator/query/AggregationOperator.java
CovarianceTuple
to add fields(sum of squares?) that help calculate the standard deviation for the 2 columnsorg/apache/pinot/segment/local/customobject
org/apache/pinot/core/query/aggregation/function
and see how they are wired inTesting
CovarianceQueriesTest.java
Please let me / @jasperjiaguo know if you have any questions!
Got it, i will go through the steps mentioned.
Thanks @SabrinaZhaozyf , @siddharthteotia for sharing the resources :)
working on this one, should have a PR in a few days with tests.
Hi @subkanthi - just curious if you are planning to put out a PR for this.
Hi @subkanthi - just curious if you are planning to put out a PR for this.
Hi @siddharthteotia , should have the PR up in a day, finishing the last test.
@siddharthteotia while testing, noticed that with the following implementation get a lot of NaN values, but with apache commons math PearsonCorrelation class its not NaN, just like the skewness
and kurtosis
implementation trying to change to use the apache commons function.
https://github.com/subkanthi/pinot/pull/1
double sumX = correlationTuple.getSumX();
double sumY = correlationTuple.getSumY();
double sumXY = correlationTuple.getSumXY();
double squareSumX = correlationTuple.getSquareSumX();
double squareSumY = correlationTuple.getSquareSumY();
double bottom = Math.sqrt((count * squareSumX
- sumX * sumX) * (count * squareSumY - sumY * sumY));
if (bottom == 0) {
return 0d;
}
double top = count * sumXY - sumX * sumY;
return top / bottom;
}
May be you are running into divide by 0 problem ? Did you try step into the code to understand where it is turning into NaN ?
cc @jasperjiaguo / @SabrinaZhaozyf
NaN could be due to Math.sqrt(negative_number) or 0.0/0.0 We have recently discovered this by-definition impl of covariance/correlation has numerical stability issue when E[x^2] ~ E[x]^2 >> 0 (see 1 2 3). Could we also use similar implementations? I'm not sure if the online algorithm is already available as a library, but feel free to use if apache common has it.
Thanks @jasperjiaguo , will compare the Trino implementation, Apache commons does expose PearsonCorrelation. will update the PR in a day or two.
Add support for calculating Pearson's coefficient
corr(x,y)
as part of the effort in #8493. Can leverage implementation for COVAR_SAMP and COVAR_POP as mentioned in #9236.