Open torshind opened 1 year ago
Related to: #686
I think for dictionary types (and other future encoded types, such as REE), we should develop some facility for mapping to a canonical "logical type". So Dictionary(Int8, Utf8)
should map to Utf8
(and so should LargeUtf8
), which we know is supported.
When can we expect this to be implemented?
I tried reproducing the given script, and it no longer errors. It does in fact write out table_test
. The resulting columns have different types though. E.g. the chV1_DD_RTwaveExists
column starts out as a categorical with values "0"
and "1"
. Once you write it out using the reproducer script above and then read the resulting table, you get a column type of string.
So, in some sense this is fixed. However, the ideal would be to preserve categoricals as categoricals, rather than strings. Is there a technical reason this can't be done, or is it just a matter of someone having the time? If the latter, what specifically needs doing?
So, in some sense this is fixed. However, the ideal would be to preserve categoricals as categoricals, rather than strings. Is there a technical reason this can't be done, or is it just a matter of someone having the time? If the latter, what specifically needs doing?
This is not possible. Categorical is not a supported primitive type in the delta protocol.
If you would like to have it be a supported type, you need to post an RFC in the main delta repo. Only once it's introduced there in the protocol, we can add support
Thank you, that's good to know.
Given the reproducer no longer fails, if that matches other people's results then this issue can be closed?
Environment
Delta-rs version: 0.8.1
Binding: python
Environment:
Bug
What happened: write_deltalake fails writing a simple dataset with categorical columns
What you expected to happen: write_deltalake not to fail
How to reproduce it: Minimal test case to reproduce it: