failed to store DataFrame with column multi-index

orbeckst commented 8 years ago

With MDS 0.5.1 the following fails:

import pandas as pd
import mdsynthesis as mds

df = pd.DataFrame({('R1', 'NZ1'): np.arange(3), ('R1', 'NZ2'): np.arange(3,0,-1),
                   ('T2', 'OG1'): np.arange(3)*0.5,
                  ('Q3', 'OE1'): np.arange(3)*2, ('Q3', 'OE1'): np.arange(3)*(-2),
                  })

sim = mds.Sim('boba')
sim.data.add('multi', df)

with the error


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-105-5af881236448> in <module>()
----> 1 sim.data.add('multi', df)

/tmp/src/datreant/datreant/aggregators.py in inner(self, handle, *args, **kwargs)
    609 
    610             try:
--> 611                 out = func(self, handle, *args, **kwargs)
    612             finally:
    613                 del self._datafile

/tmp/src/datreant/datreant/aggregators.py in add(self, handle, data)
    688 
    689         """
--> 690         self._datafile.add_data('main', data)
    691 
    692     def remove(self, handle, **kwargs):

/tmp/src/datreant/datreant/persistence.py in add_data(self, key, data)
   1380                 os.path.join(self.datadir, pydatafile), logger=self.logger)
   1381 
-> 1382         self.datafile.add_data(key, data)
   1383 
   1384         # dereference

/tmp/src/datreant/datreant/persistence.py in inner(self, *args, **kwargs)
    292                 self.handle = self._open_file_w()
    293                 try:
--> 294                     out = func(self, *args, **kwargs)
    295                 finally:
    296                     self.handle.close()

/tmp/src/datreant/datreant/persistence.py in add_data(self, key, data)
   1567             self.handle.put(
   1568                 key, data, format='table', data_columns=True, complevel=5,
-> 1569                 complib='blosc')
   1570         except AttributeError:
   1571             self.handle.put(

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in put(self, key, value, format, append, **kwargs)
    812             format = get_option("io.hdf.default_format") or 'fixed'
    813         kwargs = self._validate_format(format, kwargs)
--> 814         self._write_to_group(key, value, append=append, **kwargs)
    815 
    816     def remove(self, key, where=None, start=None, stop=None):

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
   1250 
   1251         # write the object
-> 1252         s.write(obj=value, append=append, complib=complib, **kwargs)
   1253 
   1254         if s.is_table and index:

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
   3755         self.create_axes(axes=axes, obj=obj, validate=append,
   3756                          min_itemsize=min_itemsize,
-> 3757                          **kwargs)
   3758 
   3759         for a in self.axes:

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   3357             axis, axis_labels = self.non_index_axes[0]
   3358             data_columns = self.validate_data_columns(
-> 3359                 data_columns, min_itemsize)
   3360             if len(data_columns):
   3361                 mgr = block_obj.reindex_axis(

/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.pyc in validate_data_columns(self, data_columns, min_itemsize)
   3220         if info.get('type') == 'MultiIndex' and data_columns:
   3221             raise ValueError("cannot use a multi-index on axis [{0}] with "
-> 3222                              "data_columns {1}".format(axis, data_columns))
   3223 
   3224         # evaluate the passed data_columns, True == use all columns

ValueError: cannot use a multi-index on axis [1] with data_columns True

It is quite likely that this is a problem that I (or MDS) have with pandas --- any insights welcome.

orbeckst commented 8 years ago

StackOverflow Pandas HDFStore select from nested columns not overly helpful...

dotsdl commented 8 years ago

Ah...this is not something we can fix, as it's a limitation of the pandas.HDFStore object, which handles conversion of pandas objects into PyTables objects. What it's saying is that having a multi-index on the columns isn't something it can handle.

It can, however, handle a multi-index on the rows, so doing:

s.data.add('multi', df.transpose())

works just fine. Does that help you, though?

A problem you might have despite this is that appends can only be done on the rows, not on columns. So if the code you're writing to generate the data needs to keep appending to the already-stored DataFrame then you'll have to drop the multi-index on the columns entirely. :/

If you don't care about appends, then you shouldn't lose anything from this workaround. However, depending on the dimensions (number of columns) read/write performance may not be great. DataFrame storage tends to be better when there are < 1000 columns.

orbeckst commented 8 years ago

Thanks. I suppose I could live with the workaround; in the mean time I've done aggregating before converting to a DataFrame.

But it is an inconvenience (from pandas' side) that one cannot store everything that one can build. (I could probably redesign my data structures but part of the appeal here is that I can do relatively quick and dirty analysis in a sane framework).

I am closing this issue because MDS/datreant cannot do anything about it.

dotsdl commented 8 years ago

Yeah, I don't think there is any persistence format supported in pandas that can support all of pandas own data structures, unfortunately.

datreant / MDSynthesis

failed to store DataFrame with column multi-index #46