Closed asfimport closed 4 years ago
Wes McKinney / @wesm:
[~chairmank]
can you also send an e-mail to dev@parquet.apache.org about this? We've been going around in circles on this LZ4 stuff and I think it's time that we fix this up once and for all across the implementations
cc @pitrou @fsaintjacques @xhochy
Antoine Pitrou / @pitrou: In https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/SnappyCodec.cc , it seems that Hadoop uses the same length prefix with Snappy. Can someone explain why the problem only occurs with LZ4?
Antoine Pitrou / @pitrou: Responding to myself: parquet-mr doesn't use the Hadoop Snappy codec because:
* Snappy compression codec for Parquet. We do not use the default hadoop
* one since that codec adds a blocking structure around the base snappy compression
* algorithm. This is useful for hadoop to minimize the size of compression blocks
* for their file formats (e.g. SequenceFile) but is undesirable for Parquet since
* we already have the data page which provides that.
but for some reason it didn't bother to do the same for LZ4. Uh oh :-/
Wes McKinney / @wesm: Issue resolved by pull request 7789 https://github.com/apache/arrow/pull/7789
mario luzi: Hello , we just tried latest apache-arrow version 3.0.0 and the write example included in low level api example, but lz4 still seems not compatible with Hadoop . we got this error reading over hadoop file parquet produced with 3.0.0 library :
[leal@sulu parquet]$ ./hadoop-3.2.2/bin/hadoop jar apache-parquet-1.11.1/parquet-tools/target/parquet-tools-1.11.1.jar head --debug parquet_2_0_example2.parquet 2021-02-04 21:24:36,354 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1500001 records. 2021-02-04 21:24:36,355 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 2021-02-04 21:24:36,397 INFO compress.CodecPool: Got brand-new decompressor [.lz4] 2021-02-04 21:24:36,410 INFO hadoop.InternalParquetRecordReader: block read in memory in 55 ms. row count = 434436 org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file [file:/home/leal/parquet/parquet_2_0_example2.parquet|file://home/leal/parquet/parquet_2_0_example2.parquet] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) at org.apache.parquet.tools.command.HeadCommand.execute(HeadCommand.java:87) at org.apache.parquet.tools.Main.main(Main.java:223) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:323) at org.apache.hadoop.util.RunJar.main(RunJar.java:236) Caused by: java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:275) at org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:232) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) at java.io.DataInputStream.readFully(DataInputStream.java:195) any advice ? we need to write Lz4 files by C++ and read oover Hadoop jobs but still stuck on this problem .
Antoine Pitrou / @pitrou: [~mario.luzi] I suggest you open a new issue for the Java implementation and attach the file there.
mario luzi: @pitrou would you be so kind to tell me what it is the right java project for this ? are you sure that problem is there and not anymore here ? looking to this arrow parquet code i found the Codec LZ4_hadoop , so i suppose someone added it and saw it working before put this ticket as resolved.....but i'm not able to read LZ4 file from hadoop .
Antoine Pitrou / @pitrou: [~mario.luzi] Just create a new issue on this tracker, but be sure to select "parquet-mr" as component.
As described in HADOOP-12990, the Hadoop
Lz4Codec
uses the lz4 block format, and it prepends 8 extra bytes before the compressed data. I believe that lz4 implementation in parquet-cpp also uses the lz4 block format, but it does not prepend these 8 extra bytes.Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
When I attempted to read this file with parquet-cpp, I got the following error:
https://github.com/apache/arrow/issues/3491 reported incompatibility in the other direction, using Spark (which uses the Hadoop lz4 codec) to read a parquet file that was written with parquet-cpp.
Given that the Hadoop lz4 codec has long been in use, and users have accumulated Parquet files that were written with this implementation, I propose changing parquet-cpp to match the Hadoop implementation.
See also:
Reporter: Steve M. Kim Assignee: Patrick Pai
Related issues:
PRs and other links:
Note: This issue was originally created as PARQUET-1878. Please see the migration documentation for further details.