Open sergiimk opened 1 week ago
I agree this is a regression. Thank you for the callout @sergiimk
I think this is a pretty good first issue for someone as the description is clear and the need is well defined.
take
It seems hard to control the behavior of write_parquet
by single_file_output
(and I've noticed that It's never used), what really controls whether to generate a single file output is determining the suffix(in start_demuxer_task()
), there are several methods I can think of to handle this issue:
.single
to the paths that require generating a single file, and then recognize this suffix in start_demuxer_task()
.single_file_output
in DataFrameWriteOptions
, use FileSinkConfig
instead to control single file behavior.cc @alamb @sergiimk @dhegberg
Describe the bug
Consider a snippet like this:
Before v43 this would write a single file called
data
, but in v43 this is creatingdata
as a directory with a randomly named file(s) in it.This seems to be related to #13079 (cc @dhegberg) that added an extension-based heuristic.
I see this as a regression, as single file output is requested explicitly, and I don't want a heuristics to be applied.
We are using Parquet files with a content-addressable file system and our files don't have extensions.
To Reproduce
See above
Expected behavior
Considering the introduction of the extension-based heuristic I would suggest the following behavior:
with_single_file_output
is not called (single_file_output == None
) - apply the heuristicwith_single_file_output(true)
- produce a single file at the exact path specifiedwith_single_file_output(false)
- create directory under specified path if doesn't exist and write one or many filesAdditional context
-