delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.3k stars 404 forks source link

Pyarrow path encoding slightly different than our rust writers #2181

Closed ion-elgreco closed 3 months ago

ion-elgreco commented 8 months ago

Environment

PyArrow seems to encode partition paths differently than our rust writers, this becomes a bit problematic when you write with PyArrow engine and then merge with Rust, data ends up a in a different partition folder : P image

liamphmurphy commented 8 months ago

@ion-elgreco I'm looking at switching over to the rust engine for my Lambda that handles the deltalake writes. We partition by date than hour and never as a full timestamp, e.g. one of our partitions is date=2020-01-01/17 for hour 17, in other words no spaces in our partitions.. I'm guessing my case wouldn't be affected by this...?

roeap commented 5 months ago

@liamphmurphy - it should not be affected, as the characters there should not get encoded.

In addition - and I haven't tested this - the way the path gets encoded should not matter so long as we can decode it properly from the metadata files. i.e. there only is a problem if the path we read from the, metadata file for some reason does not match the on disk.

The issue @ion-elgreco mentions is more of an inconvenience, as delta does not use folders / filenames for filtering based on partition values, but only the metadata stored in the log.

Willem-J-an commented 4 months ago

The encoding of partition values in rust engine are not readable by spark. Supposedly spark expect spaces to be encoded like %20, but it is written double encoded like %2520.

Reading results in error like: Java.net.URISyntaxException: Illegal character in path at index 36: The index points to the space character.

I suppose it is more than just an inconvenience!

ion-elgreco commented 3 months ago

@Willem-J-an a fix will be released in 0.18.3