AlaSQL / alasql

AlaSQL.js - JavaScript SQL database for browser and Node.js. Handles both traditional relational tables and nested JSON data (NoSQL). Export, store, and import data from localStorage, IndexedDB, or Excel.
http://alasql.org
MIT License
7.01k stars 651 forks source link

Trying to implement CORR() using custom function logic, but it is not working #1966

Open daviddabingu opened 3 weeks ago

daviddabingu commented 3 weeks ago

Hello. I am trying to implement CORR() using custom function logic provided here: https://github.com/alasql/alasql/wiki/User-Defined-Functions

This is my code.

alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0,
            sumY2: 0,
            count: 0
        };
        console.log('Stage 1: Initialized accumulator');
    }

    if (stage === 1 || stage === 2) {
        // Check if valueX and valueY are valid numbers
        if (typeof valueX === 'number' && typeof valueY === 'number') {
            accumulator.sumX += valueX;
            accumulator.sumY += valueY;
            accumulator.sumXY += valueX * valueY;
            accumulator.sumX2 += valueX * valueX;
            accumulator.sumY2 += valueY * valueY;
            accumulator.count++;
            console.log('Stage 1/2: Updated accumulator with values:', valueX, valueY);
        } else {
            console.log('Stage 1/2: Skipped non-numeric values:', valueX, valueY);
        }
    }

    if (stage === 3) {
        console.log('Stage 3: Final accumulator state:', accumulator);

        // Calculate the Pearson correlation coefficient
        const count = accumulator.count;
        const sumX = accumulator.sumX;
        const sumY = accumulator.sumY;
        const sumXY = accumulator.sumXY;
        const sumX2 = accumulator.sumX2;
        const sumY2 = accumulator.sumY2;

        const numerator = (count * sumXY) - (sumX * sumY);
        const denominatorX = Math.sqrt((count * sumX2) - (sumX * sumX));
        const denominatorY = Math.sqrt((count * sumY2) - (sumY * sumY));
        const denominator = denominatorX * denominatorY;

        console.log('Stage 3: Calculated values:');

        // Check if the denominator is zero or if there are no valid data points
        if (denominator === 0 || count === 0) {
            console.log('Stage 3: Undefined correlation');
            return null; // Return null to indicate undefined correlation
        }

        const result = numerator / denominator;
        console.log('Stage 3: Correlation coefficient:', result);
        return result;
    }

    return accumulator;
};

But, somehow this returned {0 : {corr: 3}} , which signifies the calculation is not being done properly, considering pearson correlation always lies between -1 and 1... Any guess or help to figure out what to do? Thank you in advance.

mathiasrw commented 2 weeks ago

If you provide some example data and expected output it makes it easier to help out.


From the AlaSQL Bot

Your implementation of the CORR() function logic looks well-structured, but there seems to be an issue with the Pearson correlation coefficient calculation, possibly due to either the data or the implementation of the accumulation logic.

Here are a few potential issues to check:

  1. Data Points: Make sure that the data passed to CORR() contains valid numeric values and there are no undefined or non-numeric values. If valueX or valueY is non-numeric or undefined, it should not contribute to the calculation.

  2. Numerator and Denominator Calculation:

    • Numerator: Ensure that the numerator calculation (count * sumXY) - (sumX * sumY) is correct and makes sense given the data.
    • Denominator: Ensure that the denominator (denominatorX * denominatorY) does not lead to a division by zero. If denominatorX or denominatorY is zero, the correlation should be undefined, hence returning null (which you already handle).
  3. Edge Cases: Ensure your function handles edge cases, such as when there is only one data point or when all valueX or valueY are the same, which would result in a zero denominator.

Here's an improved implementation suggestion:

alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0,
            sumY2: 0,
            count: 0
        };
    }

    if (stage === 1 || stage === 2) {
        if (typeof valueX === 'number' && typeof valueY === 'number') {
            accumulator.sumX += valueX;
            accumulator.sumY += valueY;
            accumulator.sumXY += valueX * valueY;
            accumulator.sumX2 += valueX * valueX;
            accumulator.sumY2 += valueY * valueY;
            accumulator.count++;
        }
    }

    if (stage === 3) {
        const { count, sumX, sumY, sumXY, sumX2, sumY2 } = accumulator;

        const numerator = (count * sumXY) - (sumX * sumY);
        const denominatorX = Math.sqrt((count * sumX2) - (sumX * sumX));
        const denominatorY = Math.sqrt((count * sumY2) - (sumY * sumY));
        const denominator = denominatorX * denominatorY;

        if (denominator === 0 || count === 0) {
            return null;
        }

        const result = numerator / denominator;
        return result;
    }

    return accumulator;
};

Debugging Steps:

If the issue persists, you might want to inspect the input data to confirm it's as expected.

daviddabingu commented 2 weeks ago

I confirmed that there is no null or undefined in the dataset. When I try the same calculation in R, it gives me corr: 0.48.

This is the test dataset that we are using. Global Superstore_latest.csv

Thank you.

daviddabingu commented 2 weeks ago

I've checked the source file src/55functions.js where I was able to have a look at how aggr functions are defined. However, I figured all the aggr functions only take a single variable as an input, such as below.

alasql.aggr.median = alasql.aggr.MEDIAN = function (v, s, stage) {
    if (stage === 2) {
        if (v !== null) {
            s.push(v);
        }
        return s;
    }

    if (stage === 1) {
        if (v === null) {
            return [];
        }
        return [v];
    }

    if (!s.length) {
        return null;
    }

    let r = s.sort((a, b) => {
        if (a > b) return 1;
        if (a < b) return -1;
        return 0;
    });

    let middle = (r.length + 1) / 2;
    let middleFloor = middle | 0;
    let el = r[middleFloor - 1];

    if (middle === middleFloor || (typeof el !== 'number' && !(el instanceof Number))) {
        return el;
    } else {
        return (el + r[middleFloor]) / 2;
    }
};

So, it seems it only takes a single column(v), whereas calculating correlation is always a matter of two columns... just like this mathematical equation. image

Any clever way to take a detour on this?

mathiasrw commented 2 weeks ago

Yep. A .aggr function only takes one col as input. I would try to do the calculation outside of alasql if you need two.