StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.17k stars 1.82k forks source link

【bug】【The paimon catalog query field type is array<struct> Query paimon orc and parquet storage format, error】 #49800

Open dsanww opened 3 months ago

dsanww commented 3 months ago

Steps to reproduce the behavior (Required)

1、paimon on flink sql (paimon version:0.8+ , flink version:1.18): CREATE TABLE paimon_catalog.default.t3 ( theme_id string NOT NULL COMMENT '主键,全局唯一id', category_ids array<row<c_id string,c_name string>> COMMENT '品类id', update_time TIMESTAMP(0) COMMENT '更新时', PRIMARY KEY (theme_id) NOT ENFORCED ) WITH ( 'bucket' = '1', 'merge-engine' = 'partial-update', 'fields.update_time.sequence-group' = 'category_ids', 'changelog-producer' = 'lookup', 'fields.category_ids.aggregate-function' = 'collect', 'fields.category_ids.distinct' = 'true' ); insert into t3 values ( '1', array[row ('id1', 'name1')],cast('2024-08-14 10:00:00' as TIMESTAMP)); insert into t3 values ( '1', array[row ('id2', 'name2')],cast('2024-08-14 10:00:00' as TIMESTAMP)); insert into t3 values ( '1', array[row ('id3', 'name3')],cast('2024-08-14 10:00:00' as TIMESTAMP)); insert into t3 values ( '1', array[row ('id4', 'name4')],cast('2024-08-14 10:00:00' as TIMESTAMP)); 2、starrocks(3.3.1): select * from paimon08_catalog_fs.default.t3;

Expected behavior (Required)

no error

Real behavior (Required)

Caused by: java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1 at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248) at java.base/java.util.Objects.checkIndex(Objects.java:372) at java.base/java.util.ArrayList.get(ArrayList.java:459) at org.apache.paimon.shade.org.apache.parquet.schema.GroupType.getType(GroupType.java:216) at org.apache.paimon.format.parquet.reader.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:345) at org.apache.paimon.format.parquet.reader.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:295) at org.apache.paimon.format.parquet.ParquetReaderFactory.createWritableVectors(ParquetReaderFactory.java:204) at org.apache.paimon.format.parquet.ParquetReaderFactory.createReaderBatch(ParquetReaderFactory.java:194) at org.apache.paimon.format.parquet.ParquetReaderFactory.createPoolOfBatches(ParquetReaderFactory.java:186) at org.apache.paimon.format.parquet.ParquetReaderFactory.createReader(ParquetReaderFactory.java:105) at org.apache.paimon.format.parquet.ParquetReaderFactory.createReader(ParquetReaderFactory.java:67) at org.apache.paimon.io.FileRecordReader.(FileRecordReader.java:82) at org.apache.paimon.io.KeyValueFileReaderFactory.createRecordReader(KeyValueFileReaderFactory.java:144) at org.apache.paimon.io.KeyValueFileReaderFactory.createRecordReader(KeyValueFileReaderFactory.java:107) at org.apache.paimon.io.KeyValueFileReaderFactory.createRecordReader(KeyValueFileReaderFactory.java:98) at org.apache.paimon.mergetree.MergeTreeReaders.lambda$readerForRun$2(MergeTreeReaders.java:86) at org.apache.paimon.mergetree.compact.ConcatRecordReader.create(ConcatRecordReader.java:51) at org.apache.paimon.mergetree.MergeTreeReaders.readerForRun(MergeTreeReaders.java:88) at org.apache.paimon.mergetree.MergeTreeReaders.lambda$readerForSection$1(MergeTreeReaders.java:76) at org.apache.paimon.mergetree.MergeSorter.mergeSort(MergeSorter.java:124) at org.apache.paimon.mergetree.MergeTreeReaders.readerForSection(MergeTreeReaders.java:78) at org.apache.paimon.operation.MergeFileSplitRead.lambda$createMergeReader$2(MergeFileSplitRead.java:267) at org.apache.paimon.mergetree.compact.ConcatRecordReader.create(ConcatRecordReader.java:51) at org.apache.paimon.operation.MergeFileSplitRead.createMergeReader(MergeFileSplitRead.java:277) at org.apache.paimon.operation.MergeFileSplitRead.createReader(MergeFileSplitRead.java:237) at org.apache.paimon.table.source.splitread.MergeFileSplitReadProvider.lambda$create$1(MergeFileSplitReadProvider.java:51) at org.apache.paimon.operation.SplitRead$1.createReader(SplitRead.java:78) at org.apache.paimon.table.source.KeyValueTableRead.reader(KeyValueTableRead.java:118) at org.apache.paimon.table.source.AbstractDataTableRead.createReader(AbstractDataTableRead.java:82) at com.starrocks.paimon.reader.PaimonSplitScanner.initReader(PaimonSplitScanner.java:106) at com.starrocks.paimon.reader.PaimonSplitScanner.open(PaimonSplitScanner.java:116)

StarRocks version (Required)

3.3.1

chenminghua8 commented 3 months ago

There is no error when running under starrock3.3. Are you sure you are using version starrock3.3?

dsanww commented 3 months ago

There is no error when running under starrock3.3. Are you sure you are using version starrock3.3?

There is no error when running under starrock3.3. Are you sure you are using version starrock3.3? @chenminghua8 yes,version info : ./show_fe_version.sh Build version: 3.3.1 Commit hash: 2b87854 Build type: RELEASE Build time: 2024-07-18 05:32:09 Build distributor id: centos Build user: StarRocks@localhost (CentOS Linux 7 (Core)) Java compile version: openjdk full version "11.0.21+9-LTS" ./show_be_version.sh 3.3.1-2b87854 BuildType: RELEASE Build distributor id: centos Built on 2024-07-18 05:28:36 by StarRocks@localhost (CentOS Linux 7 (Core))

================================================== I have tested it again and it still has problems: 1、 CREATE EXTERNAL CATALOG paimon08_catalog_fs PROPERTIES ("paimon.catalog.type" = "filesystem", "type" = "paimon", "paimon.catalog.warehouse" = "***" );

Insert the first three data without any problems: insert into t3 values ( '1', array[row ('id1', 'name1')],cast('2024-08-14 10:00:00' as TIMESTAMP)); insert into t3 values ( '1', array[row ('id2', 'name2')],cast('2024-08-14 10:00:00' as TIMESTAMP)); insert into t3 values ( '1', array[row ('id3', 'name3')],cast('2024-08-14 10:00:00' as TIMESTAMP)); Until the fourth data is inserted, the query reports an error: insert into t3 values ( '1', array[row ('id4', 'name4')],cast('2024-08-14 10:00:00' as TIMESTAMP)); then error(query paimon 0.8): FAILED: Failed to call the nextChunkOffHeap method of off-heap table scanner. java exception details: java.io.IOException: Failed to get the next off-heap table chunk of paimon. at com.starrocks.paimon.reader.PaimonSplitScanner.getNext(PaimonSplitScanner.java:163) at com.starrocks.jni.connector.ConnectorScanner.getNextOffHeapChunk(ConnectorScanner.java:101) Caused by: java.lang.ClassCastException: class org.apache.paimon.data.NestedRow cannot be cast to class org.apache.paimon.data.columnar.ColumnarRow (org.apache.paimon.data.NestedRow and org.apache.paimon.data.columnar.ColumnarRow are in unnamed module of loader com.starrocks.utils.loader.ChildFirstClassLoader @3c321bdb) at com.starrocks.paimon.reader.PaimonColumnValue.unpackStruct(PaimonColumnValue.java:113) at com.starrocks.jni.connector.OffHeapColumnVector.appendValue(OffHeapColumnVector.java:568) at com.starrocks.jni.connector.OffHeapColumnVector.appendArray(OffHeapColumnVector.java:426) at com.starrocks.jni.connector.OffHeapColumnVector.appendValue(OffHeapColumnVector.java:556) at com.starrocks.jni.connector.OffHeapTable.appendData(OffHeapTable.java:95) at com.starrocks.jni.connector.ConnectorScanner.appendData(ConnectorScanner.java:86) at com.starrocks.paimon.reader.PaimonSplitScanner.getNext(PaimonSplitScanner.java:153) ... 1 more

Here is my query: set catalog paimon08_catalog_fs; use default; -- show tables; select * from t3;

In addition, I also found a problem: query different versions of the paimon error, when using a higher version paimon0.9 +, the earliest I mentioned will be thrown IndexOutOfBoundsException exception

dsanww commented 3 months ago

I have a rough idea of what the likely root cause is:

  1. When the paimon table is stored in orc format, the query will report an error after adding several pieces of data: FAILED: Failed to call the nextChunkOffHeap method of off-heap table scanner. java exception details: java.io.IOException: Failed to get the next off-heap table chunk of paimon.

2、Article when stored for parquet format, increase the number of data, the query will quote IndexOutOfBoundsException, wonder if the query of the two formats exist a bug