Open peter-t-fox opened 4 years ago
Are the std::string
getting turned into bytes
rather than str
when being moved into python?
no, yet turned into byte. py::list l = py::cast(table.get_column<std::string>(colname));
I tried "to_hdf()" save table, it is possible to write columnto_record()
is used, string col has error, it may worth have a look C++ code readme.md for more details.
Heads up @mrow84 @bobturneruk - the "data pipeline api" label was applied to this issue.
The error message, "runtime error: TypeError: Object dtype dtype('O') has no native HDF5 equivalent", suggests that the data type is "object", i.e. it's a boxed representation in the dataframe. This post suggests that strings in dataframes, being variable length, are always stored as objects. We can convert them to a fixed-length string in the C++ wrapper, i.e. the C++ equivalent of
df['column'] = df['column'].astype('|S80') #where the max length is set at 80 bytes,
or the Python code could detect the object, that the object is a string, and convert it to a fixed length string. The underlying problem appears to be that the numpy.recarray type, used by dataframe.to_records, as input to h5py, doesn't support variable-length strings. I expect that tables containing strings will be essential for most models. Maybe it is cleanest for the Python code to convert all object(string)s in the dataframe to fixed-length strings. It could look and see what the longest string is, or we could set a maximum length in the spec. The latter might be good, in the case of possibly needing to do streamed writes in future.
The table implementation has problems when writing a table containing a column of strings. A
TypeError
is produced on the Python side. The following notes have been pulled from the README file.The python pipeline API, converts DataFrame into records then write to hdf5, data table are kept in tabular data format, but
std::string
as column data format is not supported.