ScottishCovidResponse / SCRCIssueTracking

Central issue tracking repository for all repos in the consortium
6 stars 0 forks source link

Table implementation does not support column of strings #555

Open peter-t-fox opened 4 years ago

peter-t-fox commented 4 years ago

The table implementation has problems when writing a table containing a column of strings. A TypeError is produced on the Python side. The following notes have been pulled from the README file.

The python pipeline API, converts DataFrame into records then write to hdf5, data table are kept in tabular data format, but std::string as column data format is not supported.

runtime error: TypeError: Object dtype dtype('O') has no native HDF5 equivalent

mrow84 commented 4 years ago

Are the std::string getting turned into bytes rather than str when being moved into python?

qingfengxia commented 4 years ago

no, yet turned into byte. py::list l = py::cast(table.get_column<std::string>(colname));

https://github.com/ScottishCovidResponse/data_pipeline_api/blob/cppbinding-dev/bindings/cpp/datapipeline.cc#L160

I tried "to_hdf()" save table, it is possible to write column, but not in tabular format. meanwhile, once to_record() is used, string col has error, it may worth have a look C++ code readme.md for more details.

github-actions[bot] commented 4 years ago

Heads up @mrow84 @bobturneruk - the "data pipeline api" label was applied to this issue.

ianhinder commented 4 years ago

The error message, "runtime error: TypeError: Object dtype dtype('O') has no native HDF5 equivalent", suggests that the data type is "object", i.e. it's a boxed representation in the dataframe. This post suggests that strings in dataframes, being variable length, are always stored as objects. We can convert them to a fixed-length string in the C++ wrapper, i.e. the C++ equivalent of

df['column'] = df['column'].astype('|S80') #where the max length is set at 80 bytes,

or the Python code could detect the object, that the object is a string, and convert it to a fixed length string. The underlying problem appears to be that the numpy.recarray type, used by dataframe.to_records, as input to h5py, doesn't support variable-length strings. I expect that tables containing strings will be essential for most models. Maybe it is cleanest for the Python code to convert all object(string)s in the dataframe to fixed-length strings. It could look and see what the longest string is, or we could set a maximum length in the spec. The latter might be good, in the case of possibly needing to do streamed writes in future.