Gmousse / dataframe-js

No Maintenance Intended
https://gmousse.gitbooks.io/dataframe-js/
MIT License
460 stars 38 forks source link

[FEATURE] Pivot dataframe #99

Open Irio opened 5 years ago

Irio commented 5 years ago

Is your feature request related to a problem? Please describe. Pivoting is a common task in data processing. It is used when we want to calculate statistics of specific attributes.

From Pandas' documentation:

Screen Shot 2019-10-24 at 13 04 20 Screen Shot 2019-10-24 at 13 04 32

Describe the solution you'd like I've been looking to libraries capable of pivoting dataframes. Pandas, in my view, has a very complete implementation of the method.

Describe alternatives you've considered This is a need I had at work. Since I could only find one library – AlaSQL – and with only a simple implementation, that does not pivot over multiple columns or allows pivoting without explicit index, I've written my own version of the function. To derive the algorithm, I replicated the behavior seen in the pivot_table function of Pandas.

Additional context I have already written the function, and it works for a few test cases I listed:

✓ with one value, one index, and one column (130ms)
✓ without index, aggregates values only by columns (61ms)
✓ without columns, aggregates values only by columns (69ms)
✓ with multiple values, one index, and one column (67ms)
✓ with one value, one index, and multiple columns (97ms)
✓ with one value, multiple index, and one column (84ms)
✓ with one value, multiple index, and multiple columns (177ms)

My function is not integrated into dataframe-js, but it works on dataframes generated with dataframe-js. Would you be interested in bringing this feature in? If so, I may need some help on reviewing it to ensure it follows the conventions of the project – of source code and unit tests. I could open a WIP pull request and we discuss the necessary changes on the go.

function pivotTable(dataframe, values = [], index = [], columns = []) {
  if (values.length > 1) {
    throw 'Not implemented';
  }
  const value = values[0];
  const hasSelectedIndex = index && index.length;
  const hasSelectedColumns = columns && columns.length;

  let agg = new Map();
  const aggKeys = index.concat(columns).filter((x) => x);
  const groupedDF = dataframe['groupBy'](...aggKeys);
  let rowKey = value;
  let colKey = value;
  groupedDF.aggregate(group => {
    const row = group.getRow(0);
    if (hasSelectedIndex) {
      rowKey = row.select(...index).toArray().join('_');
    }
    const baseAttrs = Object.fromEntries(index.map((i) => [i, row.get(i)]));
    agg.set(rowKey, agg.get(rowKey) || baseAttrs);
    if (hasSelectedColumns) {
      colKey = row.select(...columns).toArray().join('_');
    }
    agg.get(rowKey)[colKey] = group.stat.sum(value);
  })
  return new DataFrame(Array.from(agg.values()));
}
Gmousse commented 4 years ago

Hi @Irio, Thank you for your sugesstion. Sorry for the delay, I wasn't active these days.

Have you look the GroupedDataFrame .pivot method (https://github.com/Gmousse/dataframe-js/blob/develop/src/group.js#L156) ? It should works similarly.

Irio commented 4 years ago

Hi @Gmousse,

Yes, I have look at them, but they don't seem to cover all the cases I needed and Pandas supports. There's always the chance they are too specific, so in this case, let me know and we can close this issue.

My major pain-point was doing operations with multiple attributes, either in index, columns, or the attribute to aggregate on.

Here are all the cases I implemented using dataframe-js and are supported by Pandas:

https://colab.research.google.com/drive/1tBXUtOzTZiTli2JR1VYFODYf7wxO6PNF

And here, my attempt to reproduce them using dataframe-js@1.4.3. The only case that can be fully reproduced, at least with my understanding of the docs, is the first.

https://gist.github.com/Irio/c23af8d00768d48acecb87a41d62905e