data-forge / data-forge-ts

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
http://www.data-forge-js.com/
MIT License
1.31k stars 76 forks source link

DataFrame distinct with multiple columns in the selector is not working if the name of the columns is variable #155

Open marcodisa opened 1 year ago

marcodisa commented 1 year ago

Hi, I'm not sure if this is the proper section to post this question, First of all I’d like to congratulate you on the amazing work you’ve done with Data Forge npm and Data Forge Notebook. It’s a very extraordinary package and the Notebook is an extremely useful tool. I’m seriously considering the upgrade to the pro version for my further projects.

Anyway, I’m stuck on a DataFrame distinct with two parameters in the selector.

If I run the code below it perfectly does what I want, “filiale” and “CAP” are two columns of df.

const safeJoinChar = '$';
var distinct_join = df.distinct(value => [value.filiale, value.CAP].join(safeJoinChar));
console.log(distinct_join.toString());

Actually I need to put the execution of the select distinct into a loop. I need the select distinct of every column with all the other columns, two columns at a time. Long story short, I have a loop that in every iteration assigns the value to two variables serie1_name and serie2_name, in every iteration I correctly get my two variables like the code below. Now I need the dataframe distinct to work with variable names not with the explicit names of the columns. How the code below should be to work fine with variable column names?

var serie1_name = "filiale";
var serie2_name = "CAP";
const safeJoinChar = '$';

var distinct_join = df.distinct(value => [value.serie1_name, value.serie2_name].join(safeJoinChar));

console.log(distinct_join.toString());

I’ve tried many ways to make it work but I wasn’t able to handle it myself. Is there something I’m missing out? Can I get any hints or suggestions?

Thank you very much!

marcodisa commented 1 year ago

After many attempts I solved the issue using the javascript function constructor to create a function from a string

var serie1_name = "filiale";
var serie2_name = "CAP";
const safeJoinChar = '$';

var distinctFormula = new Function(serie1_name, serie2_name, safeJoinChar, "return value => [value."+ serie1_name + ", value." + serie2_name + "].join("+safeJoinChar+")");

var distinct_join = df.distinct(distinctFormula());

console.log(distinct_join.toString());

Now it works perfectly!

Anyway, is there a more clever solution?

ashleydavis commented 1 year ago

That's strange. You shouldn't have to use a string.

Are you able to publish a runnable code example in GitHub of the broken version? Then I can take a look at it.

marcodisa commented 1 year ago

Hi, thanks for the reply. Well, I need to run an automatic exploratory analysis on datasets that I don't know, no human will look at the dataset but it all has to be handled by an algorithm. For this reason I loop through all the columns and due to the fact I don't know their names in advance I need them to be variable. Moreover I need to compare the distinct count of every columns with the join distinct count of each columns with all the others, two columns at a time.

Here is an example code and how I solved the "issue"

const dataForge = require('data-forge');
require('data-forge-fs');

csvString = `col0,col1,col2,col3
0,Aa,Ba,Aa
1,Ab,Bb,Bb
2,Ac,Bc,Cc
3,Ad,Bd,Dd
4,Ae,Be,Ee`

const df = dataForge.fromCSV(csvString);
var columns = df.getColumns();
var series_distincts = [];
const safeJoinChar = '$';

for (const column of columns) {            //loop through all the columns
    var name = column.name;
    const count_distinct = column.series.distinct().filter(value => value != "").count();     //single column distinct count

    let element = {"name": name, "count_distinct": count_distinct};
    series_distincts.push(element);
}
//console.log(series_distincts);

for (var i = 0; i < series_distincts.length; i++) {         //loop to compare the single distinct count and the join distinct count

    var serie1_name = series_distincts[i].name;
    var distinct1 = series_distincts[i].count_distinct;
    console.log("Column to compare with all the others in this iteration: " + serie1_name);

    for (var j = 0; j < series_distincts.length; j++) {
        if (i == j) {
            continue;           //don't need to compare a column with itself
        } else {
            var serie2_name = series_distincts[j].name;
            var distinct2 = series_distincts[j].count_distinct;
            console.log("Column compared: " + serie2_name);

            var distinctFormula = new Function(serie1_name, serie2_name, safeJoinChar, "return value => [value." + serie1_name + ", value." + serie2_name + "].join(" + safeJoinChar + ")");         //solution I found after many attempts

            var distinct_join = df.distinct(distinctFormula()).count();

            console.log(serie1_name + " count distinct: " + distinct1);
            console.log(serie2_name + " count distinct: " + distinct2);
            console.log(serie1_name + " + " + serie2_name + " join distinct count: " + distinct_join);
        }
    }

}

Obviously this dataset doesn't make sense, it's just an example of what the algorithm should do. In fact, the algorithm then reads the join distinct count and goes on with its tasks.

Another "issue" that I experienced is that before executing a join distinct I have to remove all the spaces from the column names. (This step is not present in the example above but I do it in my algorithm). Is there a way to bypass dataframe renameSeries()?

Hope this example helps to clarify what I meant in the original question, anyway now I solved it using the javascript function constructor but I'd like to know if there is a more clever solution.

ashleydavis commented 1 year ago

Hey @marcodisa any chance you could fork the repo then get your code working as a failing test? Then I can work towards fixing the code for you and making your test pass.