apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.87k stars 3.38k forks source link

[Parquet][R] Efficiently combine parquet files #39671

Open r2evans opened 5 months ago

r2evans commented 5 months ago

Describe the enhancement requested

I recognize that appending to parquet files is not on the roadmap. Is it possible to do an efficient concatenation of two parquets with the output to a parquet file? While brute-force methods exist (read all of "A", read all of "B", and row-concatenate them however the language allows), it requires loading all data into memory. (I'm specifically targeting R, where it's perhaps more difficult to use the lower-level API.)

Part of the alternative to the "append" request (such as https://github.com/apache/arrow/issues/32708) is https://github.com/apache/arrow/issues/32708#issuecomment-1378120110:

the pattern that Arrow enables is writing multiple files and then using open_dataset() to query them lazily

This works fine in concept, though as the count of files increases, eventually there is a tradeoff with performance. This penalty can be mitigated (e.g., unify_schemas=FALSE), but eventually there may be a time when there is the desire to reduce the number of files by combining them. The brute-force read of both works, but it would be very nice to have a simple function that takes 1+ input filenames and 1 output filename (previously non-existent) and as efficiently as possible concatenates the data (handling meta, of course). I'm guessing there would need to be assumptions/requirements with regards to the schemas between the files, perhaps a first guess would require "effectively identical" (where "effectively" might allow differences such as numeric/integer or similar), but I'd still be very happy with "perfectly identical".

I'm specifically targeting R in my usage, though I guess that other languages might also take advantage of this.

Thanks!

Component(s)

Parquet, R

mapleFU commented 5 months ago

cc @wgtmac Does existing Java tool can do that?

wgtmac commented 5 months ago

cc @wgtmac Does existing Java tool can do that?

Yes, please check https://github.com/apache/parquet-mr/tree/master/parquet-cli for rewrite command.

r2evans commented 5 months ago

That's an interesting utility, thank you for the pointer to it.

I had been thinking of capability within a particular language, perhaps something baked into arrow.so or similar. Frankly, I don't have java installed where this would be used, and I'm not eager to install it just for this utility.

Is it safe to infer that since it exists there in java, there is not an immediate desire to have a compiled (with no JRE required) binary that does simple concatenation/rewrite?

Thank you again for the fast reply.

mapleFU commented 5 months ago

Emmm Arrow-rs also having one. You can regard it as a command line tool: https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-concat.rs

Currently we didn't have a so like this🤔

char101 commented 5 months ago

Fastparquet also can append existing file by rewriting the footer

https://github.com/dask/fastparquet/blob/fb545a5d8147eb111eded0d5ac11eda03c574134/fastparquet/writer.py#L966-L984

Unfortunately fastparquest can't do delta encoding yet and I find that with time series data, delta encoding can reduce the compression by 30% more. It would be great if this can be added to pyarrow too.