Gmousse / dataframe-js

No Maintenance Intended
https://gmousse.gitbooks.io/dataframe-js/
MIT License
460 stars 38 forks source link

[FEATURE] Merge dataframes with different columns #112

Open ThatIsAPseudo opened 3 years ago

ThatIsAPseudo commented 3 years ago

Is your feature request related to a problem? Please describe. I'd like to merge DataFrames with different columns.

Describe the solution you'd like I'd like to have a df1.merge(df2) way to automatically merge two dataframes, even if a column is in df1 but not in df2, filling it with

Describe alternatives you've considered Here is a snippet from @lmeyerov I found (and completed) on issue #15, that makes just what I want :

function unionDFs(a, b, fill='n/a') {
    // Merge two dataframes with different columns
    const aCols = a.listColumns(); // this line was missing on lmeyerov's original snippet
    const bCols = b.listColumns(); // this line was missing on lmeyerov's original snippet

    const aNeeds = b.listColumns().filter((v) => aCols.indexOf(v) === -1);
    const bNeeds = a.listColumns().filter((v) => bCols.indexOf(v) === -1);

    const a2 = aNeeds.reduce((df, name) => df.withColumn(name, () => fill), a);
    const b2 = bNeeds.reduce((df, name) => df.withColumn(name, () => fill), b);

    return a2.union(b2);
}

Additional context Current behaviour Capture d’écran 2020-09-27 à 16 13 57

What I'd like Capture d’écran 2020-09-27 à 16 16 06

ThatIsAPseudo commented 3 years ago

A better implementation of the unionDFs snippet :

DataFrame.prototype.merge = function(df2, fill = null) {
    // Merge two dataframes with different columns
    const aCols = df2.listColumns();
    const bCols = this.listColumns();

    const aNeeds = this.listColumns().filter((v) => aCols.indexOf(v) === -1);
    const bNeeds = df2.listColumns().filter((v) => bCols.indexOf(v) === -1);

    const a2 = aNeeds.reduce((df, name) => df.withColumn(name, () => fill), df2);
    const b2 = bNeeds.reduce((df, name) => df.withColumn(name, () => fill), this);

    return a2.union(b2);
}
lachisis commented 3 years ago

This bug can be particularly insidious - if one dataframe's columns are a subset of another's, the behavior is inconsistent.

This error is due to the use of an incorrect column comparison. It is still an issue in master: https://github.com/Gmousse/dataframe-js/blob/aebcd1b233e6c4f1bfe59d904d378c22bebde3a8/src/reusables.js#L15