Gmousse / dataframe-js

No Maintenance Intended
https://gmousse.gitbooks.io/dataframe-js/
MIT License
460 stars 38 forks source link

Are all data stored Row-Major? Meaning computation on columns is slower? #18

Closed dragoljub closed 7 years ago

dragoljub commented 7 years ago

Hi. This code looks very interesting and exciting. :+1:

Are there any plans to implement column-based data store as the underlying data container? ie Col.js vs Row.js I'm curious what motivated you to store row-based data structures rather than column-based which are generally fixed types and could offer more compression if needed via categorical columns etc.

Probably for UI tasks you are interested in full fast row-based context access of data for tool-tips etc.

Maybe all we need is a transpose function.

Gmousse commented 7 years ago

Hi @dragoljub,

Your question is really interesting. As you said, the DataFrame object is not really column-based data structure but it's actually row-based.

To write this codebase I had two options:

I have chosen Row-based because:

However Row-based has some disadvantages:

To conclude, both Column and Row have advantages (and disadvantages). I have chosen the row-based solution but it could be interesting to improve column manipulations, or to add new features. Why not create a MutableDataFrame (as scala does for some data structures) which could use similar API and column-based operations than R or pandas ? It could be interesting, but it's not in my short-time aims.

Indeed, DataFrame can be slower in some column manipulations, but it's also faster in map and reduction taks (that I use in 70% of times). I work (slowly) on a new DataFrame version including important performance (speed and memory consumption) optimizations (I hope to make it 10x faster). I will try to improve column operations and maybe to create some bridges between rows and columns (like a better .transpose() method as you said).

I hope I have answered to your question. If you have any ideas of improvment of column-based (which doesn't break the code base and the API), make a PR.

dragoljub commented 7 years ago

Thanks for the detailed response! I'll play with the code some more to better understand the usage patterns.