delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.14k stars 380 forks source link

Rust engine doesn't correctly seralize path for partions on timestamp on Windows #2382

Open thomasfrederikhoeck opened 5 months ago

thomasfrederikhoeck commented 5 months ago

Environment

Delta-rs version: main

Binding: python

Environment:


Bug

What happened: When using the rust engine timestamps are serialized with colon (:) it the file path. This does not work on Windows. OSError: Generic LocalFileSystem error: Unable to open file C:\projects\delta-rs\mytable\time=2021-01-02 03:04:06.000003\part-00001-2be14fa0-e4f4-4fc0-bf61-6779b08cf550-c000.snappy.parquet#1: The filename, directory name, or volume label syntax is incorrect. (os error 123)

What you expected to happen:

That timewas seralized like in pyarrow: time=2021-01-01%2003%3A04%3A06.000003 How to reproduce it: Run the following on windows

import pandas as pd
from deltalake import write_deltalake

dates = pd.date_range(datetime(2021,1,1,3,4,6,3),datetime(2021,1,3,3,4,6))
df = pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})
write_deltalake("mytable",df, partition_by="time", mode="overwrite",engine="rust")

More details:

ion-elgreco commented 5 months ago

In python we use the FileSystemHandler from src/filesystem.rs, this always normalizes the path:

    fn normalize_path(&self, path: String) -> PyResult<String> {
        let suffix = if path.ends_with('/') { "/" } else { "" };
        let path = Path::parse(path).unwrap();
        Ok(format!("{path}{suffix}"))
    }

Encode In theory object stores support any UTF-8 character sequence, however, certain character sequences cause compatibility problems with some applications and protocols. Additionally some filesystems may impose character restrictions, see [LocalFileSystem]. As such the naming guidelines for [S3], [GCS] and [Azure Blob Storage] all recommend sticking to a limited character subset.

A string containing potentially problematic path segments can therefore be encoded to a [Path] using [Path::from]or [Path::from_iter]. This will percent encode any problematic segments according to [RFC 1738].

thomasfrederikhoeck commented 5 months ago

~Isn't that quote saying that Path::from should be used?~ Just tried - didn't work. This doesn't mention colon and Windows so I don't know if it is missing in object store? https://docs.rs/object_store/latest/object_store/local/struct.LocalFileSystem.html#path-semantics-1

Windows forbids certain ASCII characters, e.g. < or |

thomasfrederikhoeck commented 5 months ago

@ion-elgreco I just tried out some different examples. It looks like Path::from handles illegal Windows characters such as < and | while Path::parse doesn't. But it appears that : is not handle in any case.

use object_store::path::{Path};

fn main() {
    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\file.parquet".to_string();
    // let path: Result<Path, object_store::path::Error> = Path::parse(path);
    let path_from = Path::from(path);
    println!( "{:?}", path_from);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\file.parquet".to_string();
    let path_parse: Result<Path, object_store::path::Error> = Path::parse(path);
    println!( "{:?}", path_parse);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\<file|.parquet".to_string();
    // let path: Result<Path, object_store::path::Error> = Path::parse(path);
    let path_from = Path::from(path);
    println!( "{:?}", path_from);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\<file|.parquet".to_string();
    let path_parse: Result<Path, object_store::path::Error> = Path::parse(path);
    println!( "{:?}", path_parse);
}
Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03:04:06.000003%5Cfile.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\file.parquet" })
Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03:04:06.000003%5C%3Cfile%7C.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\<file|.parquet" })
thomasfrederikhoeck commented 1 month ago

Closed upstream in https://github.com/apache/arrow-rs/issues/5830 so when a new object store is released an used it should be fixed ✌🏻