Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.85k stars 113 forks source link

Write schema for empty parquet files #2373

Open Hanspagh opened 2 weeks ago

Hanspagh commented 2 weeks ago

Is your feature request related to a problem? Please describe. I am trying out daft as an alternative to spark, in our current use of spark we make use of the feature that even if dataframe is empty, spark will still create an empty parquet file with the schema of the dataframe, I would like daft to do the same

Describe the solution you'd like I found the following issue in pyarrow which seems to indicate it is "bug" in arrow, either we wait for the upstream fix or we work around it by using pyarrow.parquet.write_table when writing parquet

Let me know what you think, I am might also be able to help drive this change

jaychia commented 1 week ago

Hi @Hanspagh !

Are you referring specifically to the df.write_parquet(...) API, and that you'd want an empty df to write an empty Parquet file?

We could likely corner-case this to work, but it might get a little messy because we do distributed writes. Each partition writes its own Parquet file(s), and so in the event that you have N partitions and all of them are empty, then we'd end up with N empty Parquet files.

Hanspagh commented 1 week ago

Yes I am referring to df.write_parquet(..), maybe I can explain a bit more about our use-case and why we 'need' empty parquet files.

In our data transformation repo we automatically smoke test all our transformation jobs with random sample data based on the input schemas for the job, and then we can validate that the output of the job matches a output schema. This work very well in spark because no matter the filters etc. spark will always produce a parquet file with a schema based on the query plan.

We are experimenting with using daft as a replacement for spark and would be very sad to lose this possibility for automated smoke-tests.

I see the problem in ending up with several empty parquet files because of partitions, for our use-case it would not really matter, since this is only used for testing anyways, but maybe we need make this behavior optional?

jaychia commented 1 week ago

That makes sense. Couple of follow-up questions for you here:

  1. Does Spark have the behavior of writing many empty Parquet files, or does it somehow just write one empty file?
  2. We've tested Spark behavior before and when writing Parquet files it always writes multiple files (at least one file per executor). Is this consistent with what you're observing as well?

There is a separate but perhaps related thread on getting Daft to write to a single Parquet file (instead of outputting multiple files): #2359

Perhaps in "single-file" mode it could be expected to output an empty Parquet file, but in the "many-file" mode it would output an empty directory.

Hanspagh commented 1 week ago

I just did bit of experimenting. As you said spark will always write a parquet directory (a folder with 1 or more parts in it) and dont think there is a way to get spark to a write a single file.

I tried to play a bit round with spark locally and it seems no matter the amount of executor I only get a single part, when the output file is empty. I think the official docs says that spark will write at least one file per partition.

Even if I force my qurry to use multiple partitions with repartition, the final number of partitions for my empty df will be 0, hence I get get a single part in my parquet file. If I instead change the partitions for a df with data I get one part for each partition.

That being said for our usecase it does not really matter, since this is purely for testing purposes, but if you want to align with spark you might want to adhere to ther above :).

I hope this helps, please reach out if need more information :).