apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.61k stars 1.41k forks source link

RunLengthBitPackingHybridDecoder: Reading past RLE/BitPacking stream. #1603

Open asfimport opened 10 years ago

asfimport commented 10 years ago

I am using Avro and Crunch 0.11 to write data into Hadoop CDH 4.6 in parquet format. This works fine for a few gigabytes but blows up in the RunLengthBitPackingHybridDecoder when reading a few thousands gigabytes.

parquet.io.ParquetDecodingException: Can not read value at 19453 in block 0 in file hdfs://nn-ix01.se-ix.delta.prod:8020/user/stoffe/parquet/dogfight/2014/09/29/part-m-00153.snappy.parquet
    at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
    at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
    at org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: parquet.io.ParquetDecodingException: Can't read value in column [action] BINARY at value 697332 out of 872236, 96921 out of 96921 in currentPage. repetition level: 0, definition level: 1
    at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:466)
    at parquet.column.impl.ColumnReaderImpl.getBinary(ColumnReaderImpl.java:414)
    at parquet.filter.ColumnPredicates$1.apply(ColumnPredicates.java:64)
    at parquet.filter.ColumnRecordFilter.isMatch(ColumnRecordFilter.java:69)
    at parquet.io.FilteredRecordReader.skipToMatch(FilteredRecordReader.java:71)
    at parquet.io.FilteredRecordReader.read(FilteredRecordReader.java:57)
    at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:173)
    ... 13 more
Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
    at parquet.Preconditions.checkArgument(Preconditions.java:47)
    at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:80)
    at parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:62)
    at parquet.column.values.dictionary.DictionaryValuesReader.readBytes(DictionaryValuesReader.java:73)
    at parquet.column.impl.ColumnReaderImpl$2$7.read(ColumnReaderImpl.java:311)
    at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
    ... 19 more

Environment: Java 1.7 Linux Debian Reporter: Kristoffer Sjögren / @krisskross Assignee: Reuben Kuhnert / @sircodesalotOfTheRound

Note: This issue was originally created as PARQUET-112. Please see the migration documentation for further details.

asfimport commented 10 years ago

Kristoffer Sjögren / @krisskross: I should add that data is written using AvroParquetFileTarget and SNAPPY compression. Data is read using AvroParquetFileSource with UnboundRecordFilter and includeField.

asfimport commented 10 years ago

Kristoffer Sjögren / @krisskross: Seems unrelated to compression and field inclusion.

But if I remove the UnboundRecordFilter the job finish successfully.

asfimport commented 10 years ago

Kristoffer Sjögren / @krisskross:

  public static class ActionFilter implements UnboundRecordFilter {

    private final UnboundRecordFilter filter;

    public ActionFilter() {
      filter = ColumnRecordFilter.column("action", ColumnPredicates.equalTo("bid"));
    }

    @Override
    public RecordFilter bind(Iterable<ColumnReader> readers) {
      return filter.bind(readers);
    }
  }
asfimport commented 4 years ago

Jan Morlock: any news here? We are sometimes facing the same problem with Parquet 1.5.0.

asfimport commented 4 years ago

Francisco Guerrero: facing the same issue with Parquet. When there are null fields in a column with filter, this issue will arise

asfimport commented 4 years ago

Tristan Davolt: I am facing the same issue with Parquet 1.10.0. Data is being written using AvroParquetWriter and Snappy compression. Occasionally and randomly, one file of the many we write using the same method will throw a similar error as above when being read by any parquet reader. I have not yet found a workaround. The exception is thrown for the final value of a random column. This does not only occur with null fields. Our schema defines every field as optional.


java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream. at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:80) at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:62) at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridValuesReader.readInteger(RunLengthBitPackingHybridValuesReader.java:53) at org.apache.parquet.column.impl.ColumnReaderBase$ValuesReaderIntIterator.nextInt(ColumnReaderBase.java:733) at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:568) at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:705) at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30) at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:358) at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:231) at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148) at org.apache.parquet.tools.Main.main(Main.java:223)java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.