Merge multiple parquets

yan-hic commented 3 years ago

Hello @martindurant ! I am planning on writing a simple merger service (internal only) as G BigQuery exports to several shards - a lot actually as each worker creates a shard - and we need one single file streamed over http. A simple G cloud function will come in handy for that.

Before I get my hands on it though, in order to optimize speed and especially memory usage, would you recommend fastavro, pyarrow and pyspark for that ?

martindurant commented 3 years ago

The avro format is meant for a stream-of-messages, i.e., it is packages row-wise, and you don't get the benefits of parquet's per-column encoding and compression. I would only recommend it in real-time data streams, not batches.

pyspark will be very heavy on memory usage, since you need to have both the JVM and python running, each with its own versions of al the dataframe stuff.

I don't know if pyarrow can do this specifically. fastarquet surely could: either loading the parts into pandas dataframes, using concat, and writing one dataframe out (if it is important that there be only one output row-group), or taking the binary pieces and forming them into a single file.

Before going any further, though: have you considered simply sending a .zip or .tar archive?

yan-hic commented 3 years ago

Sorry I meant fastparquet as an option, not fastavro.

It turns out there is bug on BQ side as a single export file is possible through the bq CLI, just not through SQL - https://issuetracker.google.com/issues/181016197

The consuming client is MS PowerBI/PowerQuery which can take a parquet URL since recently. It does not support Avro, or gzip'd NJSON except from file-systems, and we don't want an intermediary local copy.

Closing this as I have a workaround on the source-side.

dask / fastparquet

Merge multiple parquets #567