data-forge / data-forge-ts

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
http://www.data-forge-js.com/
MIT License
1.34k stars 77 forks source link

Adding series matched by index #1

Closed lgomez closed 6 years ago

lgomez commented 6 years ago

Hi,

Thank you for a great library. I was looking for something like this and read about it in the latest issue of Node Weekly. Started playing with it but haven't been able to get the result I'd like. I hope you don't mind if I ask...

I have the following dataframes:

__index__  A     |    __index__  B
---------  --    |    ---------  --
0          A1    |    2          B1
1          A2    |    3          B2
2          A3    |    4          B3
3          A4    |    5          B4
4          A5    |    6          B5

I need to end up with:

__index__  A   B
---------  --  --
0          A1
1          A2
2          A3  B3
3          A4  B4
4          A5  B5
5              B5
6              B5

The data comes from a bunch of files that contain one 2D array each structured like this:

// A.json     |    // B.json
[             |    [
  [0, A1],    |      [2, B1],
  [1, A2],    |      [3, B2],
  [2, A3],    |      [4, B3],
  [3, A4],    |      [5, B4],
  [4, A5]     |      [6, B5]
]             |    ]

Notice how I need the resulting DataFrame to use the file names as the column titles.

I tried using concat and joins but don't quite get this result. Would you mind pointing me in the right direction?

Thank you,

ashleydavis commented 6 years ago

Thanks for logging an issue.

There's always many ways to do things like this, here's one way.

This is a bit tricky, because your data isn't really stored in the Data-Forge style, but I'll give you a solution where you load the data manually.

First though I had to modify your data slightly to make it proper JSON syntax:

// A.json
[
  [0, "A1"],
  [1, "A2"],
  [2, "A3"],
  [3, "A4"],
  [4, "A5"] 
]
// B.json
[
  [2, "B1"],
  [3, "B2"],
  [4, "B3"],
  [5, "B4"],
  [6, "B5"]
]

Your data can't be loaded directly into a dataframe because it doesn't contain any column names.

So instead I load manually, parse manually and pass the loaded data into Series:

const dataForge = require("data-forge");
const fs = require('fs');

let a = new dataForge.Series(JSON.parse(fs.readFileSync("A.json")));
let b = new dataForge.Series(JSON.parse(fs.readFileSync("B.json")));

Now I inflate each series to a dataframe and separate out the index and A and B columns:

let aDF = a.inflate(row => ({ index: row[0], A: row[1] }));
let bDF = b.inflate(row => ({ index: row[0], B: row[1] }));

At this point I print both series to check what I have:

console.log("a:");
console.log(aDF.toString());

console.log("b:");
console.log(bDF.toString());

I see the following output:

a:                  
__index__  index  A 
---------  -----  --
0          0      A1
1          1      A2
2          2      A3
3          3      A4
4          4      A5

b:                  
__index__  index  B 
---------  -----  --
0          2      B1
1          3      B2
2          4      B3
3          5      B4
4          6      B5

Now I'm ready to join these two dataframes by connecting their index columns and merging their A and B columns:

const final = aDF.joinOuter(bDF, 
    rowA => rowA.index, // Column from dataframe a to merge on.
    rowB => rowB.index, // Column from dataframe b to merge on.
    (rowA, rowB) => {  // Selector function to merge rows from a and b.
        return {
            index: rowA ? rowA.index : rowB.index, // Merge column 0 as the index.
            A: rowA ? rowA.A : undefined,  // Note that we are merging column 1 from a and sometimes the value doesn't exist.
            B: rowB ? rowB.B : undefined   // Note that we are merging column 1 from b and sometimes the value doesn't exist.
        };
    }
);

Then print the result to check:

console.log("final:");
console.log(final.toString()); 

I see this result:

final:
__index__  index  A   B
---------  -----  --  --
0          0      A1
1          1      A2
2          2      A3  B1
3          3      A4  B2
4          4      A5  B3
5          5          B4
6          6          B5

If you really want exactly the same result as you proposed you simply need to promote the "index" column to be the actual index of the dataframe, then drop the "index" column, as follows:

const indexed = final.setIndex("index").dropSeries("index");
console.log(indexed.toString());

And get the following output:

__index__  A   B
---------  --  --
0          A1
1          A2
2          A3  B1
3          A4  B2
4          A5  B3
5              B4
6              B5

This is the full code:

const dataForge = require("data-forge");
const fs = require('fs');

let a = new dataForge.Series(JSON.parse(fs.readFileSync("A.json")));
let b = new dataForge.Series(JSON.parse(fs.readFileSync("B.json")));

let aDF = a.inflate(row => ({ index: row[0], A: row[1] }));
let bDF = b.inflate(row => ({ index: row[0], B: row[1] }));

console.log("a:");
console.log(aDF.toString());

console.log("b:");
console.log(bDF.toString());

const final = aDF.joinOuter(bDF, 
    rowA => rowA.index, // Column from dataframe a to merge on.
    rowB => rowB.index, // Column from dataframe b to merge on.
    (rowA, rowB) => {  // Selector function to merge rows from a and b.
        return {
            index: rowA ? rowA.index : rowB.index, // Merge column 0 as the index.
            A: rowA ? rowA.A : undefined,  // Note that we are merging column 1 from a and sometimes the value doesn't exist.
            B: rowB ? rowB.B : undefined   // Note that we are merging column 1 from b and sometimes the value doesn't exist.
        };
    }
);

console.log("final:");
console.log(final.toString());

const indexed = final.setIndex("index").dropSeries("index");
console.log(indexed.toString());
ashleydavis commented 6 years ago

Please be sure to star the repo!

lgomez commented 6 years ago

Ashley,

Thank you so much for this reply. Very useful. Will try it a bit later.

  1. Repo starred!
  2. Data Wrangling with JavaScript (eBook) purchased!

Thank you

ashleydavis commented 6 years ago

Hey Luis,

Just wondering if you've seen my new library yet?

It's called Data-Forge Plot and integrates plotting/charting with Data-Forge.

Please check out my blog post on it.

It's early days yet, but I'm trying to collect feedback on it.