Open douglas-raillard-arm opened 4 months ago
IIUC you just have to pass compression='lz4'
.
Ok I think I confused myself, pyarrow is indeed consistent and uses "lz4" everywhere with the meaning of "lz4_raw", same as polars
it seems. Are you aware of any reader library that would require the legacy lz4 in their current version ?
I think recent versions of parquet-java support reading LZ4_RAW data. @wgtmac Can you confirm?
I think parquet-rs also supports it. @tustvold Am I right?
DuckDB might also support it: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files
parquet-rs supports all 3 LZ4 flavored encodings
That said, perhaps it would be easy and convenient to allow "lz4_raw" as a synonym of "lz4" in the PyArrow bindings for Parquet. What do you think, @AlenkaF @jorisvandenbossche ?
That said, perhaps it would be easy and convenient to allow "lz4_raw" as a synonym of "lz4" in the PyArrow bindings for Parquet
I agree, we can allow "lz4_raw" as a synonym with a note in the user guide.
Thanks everyone, so I'll go ahead and assume that the legacy lz4 is now more-or-less out of the picture nowadays and that "lz4" these days usually means "lz4_raw" (at least in the Python and Rust ecosystems)
The Rust ecosystem uses "lz4" to refer to the Hadoop codec, as per the parquet specification - https://github.com/apache/parquet-format/blob/master/Compression.md
I think it would be quite confusing to use "lz4"
to refer to the "lz4_raw"
codec, but if parquet-cpp already diverges from the specification here, perhaps that ship has sailed.
I think it would be quite confusing to use
"lz4"
to refer to the"lz4_raw"
codec, but if parquet-cpp already diverges from the specification here, perhaps that ship has sailed.
We want to discourage the use of the old and ill-specified "Hadoop LZ4" codec, and for that we have to ensure that asking for "lz4" selects the new "Raw LZ4" codec.
Perhaps you could remove as an option "lz4"
and replace it with "lz4_hadoop"
, that way it would still be unambiguous which encoding is being used.
Perhaps you could remove as an option
"lz4"
and replace it with"lz4_hadoop"
, that way it would still be unambiguous which encoding is being used.
I'm rather lukewarm towards this.
1) We want to ensure backwards compatibility with code that passes "lz4"
.
2) We don't want to stifle user-friendliness. "lz4" is easy to remember, while "lz4_raw" and "lz4_hadoop" probably have to be looked up every time by non-specialists.
I think recent versions of parquet-java support reading LZ4_RAW data. @wgtmac Can you confirm?
Yes, I can confirm that it is supported since 1.13.0.
Describe the enhancement requested
pyarrow.dataset.write_dataset(compression='lz4_raw')
currently fails with:And indeed, no mention of
lz4_raw
is to be found inpython/pyarrow/_parquet.pyx
.Would it be possible to add support for LZ4_RAW codec when writing parquet files, particularly using the dataset API ?
Component(s)
Python