Closed ion-elgreco closed 3 months ago
@ion-elgreco I'm looking at switching over to the rust engine for my Lambda that handles the deltalake writes. We partition by date than hour and never as a full timestamp, e.g. one of our partitions is date=2020-01-01/17
for hour 17, in other words no spaces in our partitions.. I'm guessing my case wouldn't be affected by this...?
@liamphmurphy - it should not be affected, as the characters there should not get encoded.
In addition - and I haven't tested this - the way the path gets encoded should not matter so long as we can decode it properly from the metadata files. i.e. there only is a problem if the path we read from the, metadata file for some reason does not match the on disk.
The issue @ion-elgreco mentions is more of an inconvenience, as delta does not use folders / filenames for filtering based on partition values, but only the metadata stored in the log.
The encoding of partition values in rust engine are not readable by spark. Supposedly spark expect spaces to be encoded like %20, but it is written double encoded like %2520.
Reading results in error like: Java.net.URISyntaxException: Illegal character in path at index 36: The index points to the space character.
I suppose it is more than just an inconvenience!
@Willem-J-an a fix will be released in 0.18.3
Environment
PyArrow seems to encode partition paths differently than our rust writers, this becomes a bit problematic when you write with PyArrow engine and then merge with Rust, data ends up a in a different partition folder : P