delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

write_deltalake fails writing a simple dataset with categorical columns #1326

Open torshind opened 1 year ago

torshind commented 1 year ago

Environment

Delta-rs version: 0.8.1

Binding: python

Environment:


Bug

What happened: write_deltalake fails writing a simple dataset with categorical columns

Traceback (most recent call last):
  File "/home/mvm/gitlab/lakehouse/bug.py", line 5, in <module>
    write_deltalake("./table_test", data=df, mode="overwrite")
  File "/home/mvm/.pyenv/versions/studio-py3.10/lib/python3.10/site-packages/deltalake/writer.py", line 312, in write_deltalake
    _write_new_deltalake(
deltalake.PyDeltaTableError: Schema error: Invalid data type for Delta Lake: Dictionary(Int8, Utf8)

What you expected to happen: write_deltalake not to fail

How to reproduce it: Minimal test case to reproduce it:

from deltalake.writer import write_deltalake
from sklearn.datasets import fetch_openml

df = fetch_openml(name="arrhythmia", version=1, as_frame=True)["data"]
write_deltalake("./table_test", data=df, mode="overwrite")
wjones127 commented 1 year ago

Related to: #686

wjones127 commented 1 year ago

I think for dictionary types (and other future encoded types, such as REE), we should develop some facility for mapping to a canonical "logical type". So Dictionary(Int8, Utf8) should map to Utf8 (and so should LargeUtf8), which we know is supported.

kangshung commented 8 months ago

When can we expect this to be implemented?

itamarst commented 1 month ago

I tried reproducing the given script, and it no longer errors. It does in fact write out table_test. The resulting columns have different types though. E.g. the chV1_DD_RTwaveExists column starts out as a categorical with values "0" and "1". Once you write it out using the reproducer script above and then read the resulting table, you get a column type of string.

itamarst commented 1 month ago

So, in some sense this is fixed. However, the ideal would be to preserve categoricals as categoricals, rather than strings. Is there a technical reason this can't be done, or is it just a matter of someone having the time? If the latter, what specifically needs doing?

ion-elgreco commented 1 month ago

So, in some sense this is fixed. However, the ideal would be to preserve categoricals as categoricals, rather than strings. Is there a technical reason this can't be done, or is it just a matter of someone having the time? If the latter, what specifically needs doing?

This is not possible. Categorical is not a supported primitive type in the delta protocol.

If you would like to have it be a supported type, you need to post an RFC in the main delta repo. Only once it's introduced there in the protocol, we can add support

itamarst commented 1 month ago

Thank you, that's good to know.

itamarst commented 1 month ago

Given the reproducer no longer fails, if that matches other people's results then this issue can be closed?