Open universalmind303 opened 2 weeks ago
Just noting that this would be a departure from Spark behavior I believe, and would involve coalescing the data into 1 partition...
Should we maybe add a kwarg like single_file=True
instead?
It may deviate from spark, but all other popular data tools return a single file. (polars, pyarrow, pandas, duckdb)
for example
# polars -- single file
pl.read_csv("some_file").write_parquet('./data.parquet')
# pyarrow via polars = single file
pl.read_csv("some_file").write_parquet('./data.parquet', use_pyarrow=True)
# pandas -- single file
pd.read_csv("some_file").to_parquet('./data.parquet')
# duckdb -- single file
duckdb.read_csv("some_file").write_parquet('./data.parquet')
Just wanted to point out that the standard behavior from pyarrow is not necessarily to return a single file if you enable partitions, also it is worth noting that having multiple smaller parquet files can speed up reads when using spark (maybe also daft?). @universalmind303 maybe you can expand a bit on our usecase for having a single parquet file?
Also if you want to mimic how spark does this you can force the number of files with the repartition command.
Is your feature request related to a problem? Please describe. I want to write to a single parquet file
Describe the solution you'd like
Currently this writes to
my_file_new.parquet/<uuid>.parquet
If the user specifies a directory, we should write as above, but if they specify an exact single file like
my_file_new.parquet
, then it should be coalesced to that single file