apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.35k stars 3.49k forks source link

[Python][Parquet] Support LZ4_RAW for parquet writing #41863

Open douglas-raillard-arm opened 4 months ago

douglas-raillard-arm commented 4 months ago

Describe the enhancement requested

pyarrow.dataset.write_dataset(compression='lz4_raw') currently fails with:

Traceback (most recent call last):
  File "/work/projects/lisa/testpyarrow.py", line 3, in <module>
    _reencode_parquet('sched_switch.lz4.parquet', 'updated.parquet', compression='lz4_raw')#, row_group_size=128*1024*1024, compression='LZ4')
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "x.py", line 1, in my_write_parquet
    options = pyarrow.dataset.ParquetFileFormat().make_write_options(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset_parquet.pyx", line 206, in pyarrow._dataset_parquet.ParquetFileFormat.make_write_options
  File "pyarrow/_dataset_parquet.pyx", line 594, in pyarrow._dataset_parquet.ParquetFileWriteOptions.update
  File "pyarrow/_dataset_parquet.pyx", line 599, in pyarrow._dataset_parquet.ParquetFileWriteOptions._set_properties
  File "pyarrow/_parquet.pyx", line 1855, in pyarrow._parquet._create_writer_properties
  File "pyarrow/_parquet.pyx", line 1369, in pyarrow._parquet.check_compression_name
pyarrow.lib.ArrowException: Unsupported compression: lz4_raw

And indeed, no mention of lz4_raw is to be found in python/pyarrow/_parquet.pyx.

Would it be possible to add support for LZ4_RAW codec when writing parquet files, particularly using the dataset API ?

Component(s)

Python

pitrou commented 4 months ago

IIUC you just have to pass compression='lz4'.

douglas-raillard-arm commented 4 months ago

Ok I think I confused myself, pyarrow is indeed consistent and uses "lz4" everywhere with the meaning of "lz4_raw", same as polars it seems. Are you aware of any reader library that would require the legacy lz4 in their current version ?

pitrou commented 4 months ago

I think recent versions of parquet-java support reading LZ4_RAW data. @wgtmac Can you confirm?

I think parquet-rs also supports it. @tustvold Am I right?

DuckDB might also support it: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files

tustvold commented 4 months ago

parquet-rs supports all 3 LZ4 flavored encodings

pitrou commented 4 months ago

That said, perhaps it would be easy and convenient to allow "lz4_raw" as a synonym of "lz4" in the PyArrow bindings for Parquet. What do you think, @AlenkaF @jorisvandenbossche ?

AlenkaF commented 4 months ago

That said, perhaps it would be easy and convenient to allow "lz4_raw" as a synonym of "lz4" in the PyArrow bindings for Parquet

I agree, we can allow "lz4_raw" as a synonym with a note in the user guide.

douglas-raillard-arm commented 4 months ago

Thanks everyone, so I'll go ahead and assume that the legacy lz4 is now more-or-less out of the picture nowadays and that "lz4" these days usually means "lz4_raw" (at least in the Python and Rust ecosystems)

tustvold commented 4 months ago

The Rust ecosystem uses "lz4" to refer to the Hadoop codec, as per the parquet specification - https://github.com/apache/parquet-format/blob/master/Compression.md

I think it would be quite confusing to use "lz4" to refer to the "lz4_raw" codec, but if parquet-cpp already diverges from the specification here, perhaps that ship has sailed.

pitrou commented 4 months ago

I think it would be quite confusing to use "lz4" to refer to the "lz4_raw" codec, but if parquet-cpp already diverges from the specification here, perhaps that ship has sailed.

We want to discourage the use of the old and ill-specified "Hadoop LZ4" codec, and for that we have to ensure that asking for "lz4" selects the new "Raw LZ4" codec.

tustvold commented 4 months ago

Perhaps you could remove as an option "lz4" and replace it with "lz4_hadoop", that way it would still be unambiguous which encoding is being used.

pitrou commented 4 months ago

Perhaps you could remove as an option "lz4" and replace it with "lz4_hadoop", that way it would still be unambiguous which encoding is being used.

I'm rather lukewarm towards this. 1) We want to ensure backwards compatibility with code that passes "lz4". 2) We don't want to stifle user-friendliness. "lz4" is easy to remember, while "lz4_raw" and "lz4_hadoop" probably have to be looked up every time by non-specialists.

wgtmac commented 4 months ago

I think recent versions of parquet-java support reading LZ4_RAW data. @wgtmac Can you confirm?

Yes, I can confirm that it is supported since 1.13.0.