apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.31k stars 1.19k forks source link

Regression: `DataFrameWriteOptions::with_single_file_output` produces a directory #13323

Open sergiimk opened 1 week ago

sergiimk commented 1 week ago

Describe the bug

Consider a snippet like this:

df.write_parquet(
  "dir/data",
  DataFrameWriteOptions::new().with_single_file_output(true),
  None
).await

Before v43 this would write a single file called data, but in v43 this is creating data as a directory with a randomly named file(s) in it.

This seems to be related to #13079 (cc @dhegberg) that added an extension-based heuristic.

I see this as a regression, as single file output is requested explicitly, and I don't want a heuristics to be applied.

We are using Parquet files with a content-addressable file system and our files don't have extensions.

To Reproduce

See above

Expected behavior

Considering the introduction of the extension-based heuristic I would suggest the following behavior:

Additional context

-

alamb commented 5 days ago

I agree this is a regression. Thank you for the callout @sergiimk

I think this is a pretty good first issue for someone as the description is clear and the need is well defined.

irenjj commented 5 days ago

take

irenjj commented 5 hours ago

It seems hard to control the behavior of write_parquet by single_file_output(and I've noticed that It's never used), what really controls whether to generate a single file output is determining the suffix(in start_demuxer_task()), there are several methods I can think of to handle this issue:

  1. We can add a suffix like .single to the paths that require generating a single file, and then recognize this suffix in start_demuxer_task().
  2. Give up single_file_output in DataFrameWriteOptions, use FileSinkConfig instead to control single file behavior.

cc @alamb @sergiimk @dhegberg