Quantco / tabmat

Efficient matrix representations for working with tabular data
https://tabmat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
112 stars 6 forks source link

Consider support ELLPACK format #359

Open howsiyu opened 6 months ago

howsiyu commented 6 months ago

A lot of features matrices in practice have small number of non-zero entries per row. E.g. data that come from one-hot encoding have exactly one non-zero entry per row. These can be handled nicely by CategoricalMatrix if all the non-zero entries are one. However, this is not always the case, e.g. data that comes from sklearn.preprocessing.SplineTransformer. These would be nicely supported by ELLPACK format which is a natural generalization of CategoricalMatrix.

Another option is to support Sliced Ellpack (SELL) format which can support general sparse matrix relatively well and make SplitMatrix consists of just a dense matrix and a SELL matrix.