Closed pwais closed 5 years ago
Can you open a JIRA issue about this?
LZ4 in parquet-cpp is broken, and has been for a while AFAIK. There was a discussion on the mailing list or JIRA as I recall
@xhochy @majetideepak we should disable LZ4 until we can run integration tests to check for compatibility. Thoughts?
Yes, we should disable LZ4. I think the problem surfaced already some time ago that one of the implementation is using the framed format and the other one the non-framed and thus are incompatible.
+1. I also find the lack of compatibility checks among various writers disturbing. We hit into another (statistics) issue last week with a file written by the newer versions of Impala not being backward compatible with a slightly older parquet-cpp reader. Some details are the very bottom of this commit https://github.com/apache/impala/commit/9270346825b0bbc7708d458be076d7c26038edfc
Seems there are multiple JIRA issues that need to be associated with this
Could someone make sure these issues are opened and then we can close this issue? Thanks
I will take care of the issue tracking.
The LZ4 discussion is on this JIRA. https://issues.apache.org/jira/browse/PARQUET-1241 JIRA to disable LZ4 codec: https://issues.apache.org/jira/browse/PARQUET-1515
The JIRA https://issues.apache.org/jira/browse/PARQUET-1118 aims to build a corpus of files so that different Parquet implementations can validate. I think this is the easiest way to achieve compatibility.
Thank you @majetideepak ! Hope the docker example is helpful for a unit test. FWIW here's the Dockerfile
that sets up working Spark, Hadoop, & friends: https://github.com/pwais/au2018/blob/14313dd5195a3b516d019edf0c42c672cfca0a76/docker/Dockerfile
I've also tried to get zstd
working, but I couldn't get that working even in Spark / Hadoop. I hope whatever effort to fix lz4
might also have a moment to look into zstd
.
(bump) I tried pyarrow 0.14 and lz4 support still appears broken. There was a suggestion to "Please either add a new codec or add an option to Lz4Codec to use the framed format" ( https://issues.apache.org/jira/browse/PARQUET-1241 ). Would that bring us closer to a fix? (i.e. feature parity with pyspark / hadoop, which has "supported" lz4 for a while now).
I believe that PARQUET-1241 ("[C++] Use LZ4 frame format") does not directly address the issue that was reported here, although there is relevant discussion in the comments (like this and this).
The stack trace in the bug report shows an exception thrown by the Spark class org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader
, which uses the parquet-mr class org.apache.parquet.hadoop.ParquetFileReader
, which uses the Hadoop org.apache.hadoop.io.compress.Lz4Codec
class.
As discussed in HADOOP-12990, the Hadoop Lz4Codec
uses the lz4 block format, and it prepends 8 extra bytes before the compressed data. I believe that lz4 implementation used by pyarrow.parquet
also uses the lz4 block format, but it does not prepend these 8 extra bytes. Reconciling this incompatibility does not require implementing the framed format.
Can you please open a JIRA issue?
Thanks guys for finally closing this one up! Not having proper lz4 support is the main reason I don't use pyarrow directly today.
Looks like a bug to me. Here's a simple script:
Running it gives this error:
If we change
lz4
tosnappy
then the output is as expected:There's clearly a disagreement here of some sort. Since I believe Spark uses the official Java distro of Parquet, I'm inclined to report the issue here as a bug, though I could certainly understand if by now
Arrow
is regarded as HEAD and Java Parquet is behind. FWIWau2018/env:v1.2
is a public image.