When inferring a schema, the list_all_files uses an object store to list the files. No sorting is passed.
When the object store is a LocalFileSystem, there isn't an insurance of any file sorting (the return list of a macOs has a different sort of windows). This means that the inferred schema can be different for the same set of files.
We contact the object store (https://github.com/apache/arrow-rs/issues/3975) that point it out that the solution should be implemented in the caller of the method, applying a sort of any type, to maintain consistency between file systems.
To Reproduce
Having two parquet files in the filesystem with the schema:
#[tokio::test]
async fn infer_schema() {
let path = ListingTableUrl::parse("./files").unwrap();
let ctx = SessionContext::new();
let state = ctx.state();
let options = ListingOptions::new(Arc::new(ParquetFormat::default()));
let schema = options.infer_schema(&state, &path).await.unwrap();
schema.fields.iter().for_each(|field| println!("{0}", field.name()));
}
the result in macOs Ventura:
description
code
year
the first file pickup was the file3.parquet
and using windows
year
code
description
the first file pickup was the file1.parquet
Expected behavior
The same schema independently the OS where the code is run. A sort should be forced or at least given the possibility of passing a sort function
Describe the bug
When inferring a schema, the list_all_files uses an object store to list the files. No sorting is passed. When the object store is a LocalFileSystem, there isn't an insurance of any file sorting (the return list of a macOs has a different sort of windows). This means that the inferred schema can be different for the same set of files.
We contact the object store (https://github.com/apache/arrow-rs/issues/3975) that point it out that the solution should be implemented in the caller of the method, applying a sort of any type, to maintain consistency between file systems.
To Reproduce
Having two parquet files in the filesystem with the schema:
and executing:
the result in macOs Ventura:
the first file pickup was the file3.parquet and using windows
the first file pickup was the file1.parquet
Expected behavior
The same schema independently the OS where the code is run. A sort should be forced or at least given the possibility of passing a sort function
Additional context
No response