Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.85k stars 114 forks source link

ability to write to single parquet file #2359

Open universalmind303 opened 2 weeks ago

universalmind303 commented 2 weeks ago

Is your feature request related to a problem? Please describe. I want to write to a single parquet file

Describe the solution you'd like

daft.read_parquet("./my_file.parquet").write_parquet('my_file_new.parquet')

Currently this writes to my_file_new.parquet/<uuid>.parquet

If the user specifies a directory, we should write as above, but if they specify an exact single file like my_file_new.parquet, then it should be coalesced to that single file

jaychia commented 2 weeks ago

Just noting that this would be a departure from Spark behavior I believe, and would involve coalescing the data into 1 partition...

Should we maybe add a kwarg like single_file=True instead?

universalmind303 commented 2 weeks ago

It may deviate from spark, but all other popular data tools return a single file. (polars, pyarrow, pandas, duckdb)

for example


# polars -- single file
pl.read_csv("some_file").write_parquet('./data.parquet') 

# pyarrow via polars = single file
pl.read_csv("some_file").write_parquet('./data.parquet', use_pyarrow=True) 

# pandas -- single file
pd.read_csv("some_file").to_parquet('./data.parquet')

# duckdb  -- single file
duckdb.read_csv("some_file").write_parquet('./data.parquet')
Hanspagh commented 1 week ago

Just wanted to point out that the standard behavior from pyarrow is not necessarily to return a single file if you enable partitions, also it is worth noting that having multiple smaller parquet files can speed up reads when using spark (maybe also daft?). @universalmind303 maybe you can expand a bit on our usecase for having a single parquet file?

Also if you want to mimic how spark does this you can force the number of files with the repartition command.