DataBrewery / cubes

[NOT MAINTAINED] Light-weight Python OLAP framework for multi-dimensional data analysis
http://cubes.databrewery.org
Other
1.49k stars 314 forks source link

Compute Engine / Efficient Data Model #104

Open jcrubino opened 11 years ago

jcrubino commented 11 years ago

While searching for Cubes information I ran across discussion about Pandas integration. I am not sure what your plans but you might want to look at PyTables. It aims to model very large data sets and models them in compute efficient ways via HDF5, Numpy and Cython all with a Cpython interface.

http://www.pytables.org/moin

Stiivi commented 11 years ago

@jcrubino thanks for pointing this out. I definitely would like to have one efficient backend in the future. Concerning Pandas: at the time I was exploring possibility of using pandas, it didn't provide enough functionality for multi-dimensional modeling as pandas was (still is?) focusing mostly on financial/scientific data. The mostly numerical and non-categorical nature of the data is reflected in the functionality of Pandas. Another disadvantage of Pandas is, that despite now having possibility of joins, it is still working with datasets that have to fit into memory.

We had very intensive talk with @FrancescAlted (author of PyTables) last late October about pure "joins framework" or just "fast join functionality" or being implemented at low level on top of another framework of his called carray which can, very nicely, handle big datasets that do not fit into the memory. The solution is still on the paper only, due to lack of time and/or resources for implementing it.

Note that we are considering table-based storage forming stars/snowflakes, not any other kind of stores such as trees or set dictionaries. The libraries around Numpy that you mention can work very efficiently with numerical data stored in a single table. Despite having large number of rows, the operations are pretty straightforward: filter + iterate + aggregate. Also the datasets the libraries are working on are mostly implicitly sequential - ordered by time or sample number.

With multidimensional modeling (ROLAP, with tables) you have more challenges:

  1. joins: there is not just one table to be analyzed - there are multiple tables that need to be joined together to form "illusion" of the single table
  2. order: dataset is not implicitly ordered - order can be random, depending on user's needs; order does not have to be expressed in numerical way

The most expensive operation in multi-dimensional modeling is not the arithmetic computation itself, but very fast joins, mostly dimensions to facts. With really huge datasets you might not even fit an index of one table into memory, not mentioning the tables themselves. Sometimes you need to join 10-30 dimension tables.

What cubes needs is a framework that can do very fast joins on datasets that do not fit into memory.

Or non-tabular multidimensional store... However, I am sticking with tabular store for the time being for it's better understandability by users.