apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.79k stars 1.09k forks source link

Different schemas when inferring from local system in different OS #5779

Closed tiago-ssantos closed 8 months ago

tiago-ssantos commented 1 year ago

Describe the bug

When inferring a schema, the list_all_files uses an object store to list the files. No sorting is passed. When the object store is a LocalFileSystem, there isn't an insurance of any file sorting (the return list of a macOs has a different sort of windows). This means that the inferred schema can be different for the same set of files.

We contact the object store (https://github.com/apache/arrow-rs/issues/3975) that point it out that the solution should be implemented in the caller of the method, applying a sort of any type, to maintain consistency between file systems.

To Reproduce

Having two parquet files in the filesystem with the schema:

and executing:

#[tokio::test]
async fn infer_schema() {
    let path =  ListingTableUrl::parse("./files").unwrap();
    let ctx = SessionContext::new();
    let state = ctx.state();
    let options = ListingOptions::new(Arc::new(ParquetFormat::default()));

    let schema = options.infer_schema(&state, &path).await.unwrap();

    schema.fields.iter().for_each(|field|  println!("{0}", field.name()));
}

the result in macOs Ventura:

description
code
year

the first file pickup was the file3.parquet and using windows

year
code
description

the first file pickup was the file1.parquet

Expected behavior

The same schema independently the OS where the code is run. A sort should be forced or at least given the possibility of passing a sort function

Additional context

No response

thomas-k-cameron commented 1 year ago

I just created a PR for this issue. There are still some work to do (e.g. tests) but I hope it works!