apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 438 forks source link

[CH] Parquet read error: `Length spanned by list offsets (51135) larger than values array (length 51030)` #8014

Open KevinyhZou opened 3 days ago

KevinyhZou commented 3 days ago

Backend

CH (ClickHouse)

Bug description

Read a parquet file, with schema

message schema {
  optional binary id(UTF8);
  optional binary id1 (UTF8);
  optional binary id2 (UTF8);
  optional binary id3 (UTF8);
  optional group map1 (MAP) {
    repeated group key_value (MAP_KEY_VALUE) {
      required binary key (UTF8);
      optional binary value (UTF8);
    }
  }
  optional int64 time1;
}

Read error message:

Caused by: org.apache.gluten.exception.GlutenException: Error while reading Parquet data: Invalid: Length spanned by list offsets (51135) larger than values array (length 51030): (): While executing SubstraitFileSource
0. Poco::Exception::Exception(String const&, int) @ 0x00000000140ef2f9
1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000bd507d9
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000677f76c
3. DB::Exception::Exception<String>(int, FormatStringHelperImpl<std::type_identity<String>::type>, String&&) @ 0x000000000678724b
4. DB::ParquetBlockInputFormat::decodeOneChunk(unsigned long, std::unique_lock<std::mutex>&)::$_3::operator()() const @ 0x000000001141436c
5. DB::ParquetBlockInputFormat::decodeOneChunk(unsigned long, std::unique_lock<std::mutex>&) @ 0x00000000114135b7
6. DB::ParquetBlockInputFormat::read() @ 0x00000000114146f1
7. DB::IInputFormat::generate() @ 0x00000000113986d6
8. local_engine::NormalFileReader::pull(DB::Chunk&) @ 0x000000000c43e6c1
9. local_engine::SubstraitFileSource::generate() @ 0x000000000c43c1eb
10. DB::ISource::tryGenerate() @ 0x0000000011374e37
11. DB::ISource::work() @ 0x0000000011374c05
12. DB::ExecutionThreadContext::executeTask() @ 0x000000001138cc82
13. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x0000000011381f3f
14. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x00000000113819a9
15. DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x0000000011393734
16. DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x0000000011393899
17. local_engine::LocalExecutor::hasNext() @ 0x000000000c120331
18. Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 0x0000000006765c77

Spark version

Spark-3.3.x

Spark configurations

No response

System information

No response

Relevant logs

No response