blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

Categorical optimization #229

Open gbrener opened 6 years ago

gbrener commented 6 years ago

As the odo issue https://github.com/blaze/odo/issues/561 mentions, a bottleneck emerged with respect to datashape.Categorical instantiation pointing to this line, where the constructor coerces the input categories into a tuple. I wonder whether we should relax this constraint for Categorical objects, such that we could represent the underlying categories as numpy arrays, i.e. Series.cat.categories.values, and speed up datashape with respect to pandas/dask categorical discovery. cc @jbednar @teoliphant