Open ritchie46 opened 1 week ago
@ritchie46 That's the pyarrow writer engine, we don't really have control there to change this. Our Rust writer relies on the arrow_cast crate, which supports utf8_view https://docs.rs/arrow-cast/52.0.0/src/arrow_cast/cast/mod.rs.html#709.
Can you check with delta_write_options = {"engine":"rust"}
? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.
Also which pyarrow version did you use here?
Also which pyarrow version did you use here?
Pyarrow 16
Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.
Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?
Also which pyarrow version did you use here?
Pyarrow 16
Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.
Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?
I understand:) On the rust side, we should allow utf8-view to be passed through. But now we always cast record batches with the delta schema that gets converted to an arrow schema.
There we don't have a way yet to allow delta string to either be arrow utf8 or large utf8 or utf8 view. It will always be arrow utf8, which is also a current problem if your source has a large arrow array that's too large..
Arrow has adopted the (IMO) much better binary and string view types. Supporting these would mean we could move data zero-copy to delta-rs. Currently it fails: