apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.58k stars 1.4k forks source link

LZ4 decoding is not working over hadoop #2571

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Hello , we just tried latest apache-arrow version 3.0.0 and the write example included in low level api example, but lz4 still seems not compatible with Hadoop . we got this error reading over hadoop file parquet produced with 3.0.0 library  :

 [leal@sulu parquet]$ ./hadoop-3.2.2/bin/hadoop jar apache-parquet-1.11.1/parquet-tools/target/parquet-tools-1.11.1.jar head --debug parquet_2_0_example2.parquet 2021-02-04 21:24:36,354 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1500001 records. 2021-02-04 21:24:36,355 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 2021-02-04 21:24:36,397 INFO compress.CodecPool: Got brand-new decompressor [.lz4] 2021-02-04 21:24:36,410 INFO hadoop.InternalParquetRecordReader: block read in memory in 55 ms. row count = 434436 org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file [file:/home/leal/parquet/parquet_2_0_example2.parquet|file://home/leal/parquet/parquet_2_0_example2.parquet] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:255) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) at org.apache.parquet.tools.command.HeadCommand.execute(HeadCommand.java:87) at org.apache.parquet.tools.Main.main(Main.java:223) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:323) at org.apache.hadoop.util.RunJar.main(RunJar.java:236) Caused by: java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:275) at org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:232) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) at java.io.DataInputStream.readFully(DataInputStream.java:195)   any advice ? we need to write Lz4 files by C++ and read oover Hadoop jobs but still stuck on this problem . 

Reporter: mario luzi

Related issues:

Note: This issue was originally created as PARQUET-1974. Please see the migration documentation for further details.

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: I'm afraid I've lost track of this LZ4 compatibility issue. I can see a couple of resolved (e.g. ARROW-9177, PARQUET-1878) and a couple of still open (e.g. PARQUET-1241, PARQUET-1515) issues. All are for parquet-cpp/arrow. This is the first one for created for parquet-mr. @wesm, @pitrou, what is the status of LZ4 compatibility between parquet-cpp and parquet-mr? There were discussions to extend parquet-format to properly specify LZ4 and maybe to add another compression option for framed LZ4. I did not find any jiras about these. I am also not sure about the parquet-mr LZ4 implementation if it is considered correct or we want to update something in that side as well.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: @gszadovszky The status is that the Parquet-C++ developers did all they could to try and achieve compatibility.

Now it's up to parquet-mr to pick up the ball. [~mario.luzi] Can you post the Parquet file so that they can take a look?

If there is no resolution on the Java side I will ask for LZ4 to be dropped from the Parquet spec. Right now Parquet is partly a proprietary standard due to under-specification.

asfimport commented 3 years ago

mario luzi: @pitrou  attached file produced with 3.0.0 parquet-cpp version ...