StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.03k stars 1.82k forks source link

【BUG】The CN of the shared-data cluster will abort If we read datafrom paimon table which write by the flink #51509

Open mlbzssk opened 1 month ago

mlbzssk commented 1 month ago

we use flink write data to the paimion catalog which location is on the cos, then we use starrocks to read the data, the cn will abort. the mode of StarRocks is shared-data

Steps to reproduce the behavior (Required)

  1. create paimon catalog in the flink
    CREATE CATALOG paimon_new WITH (
    'type' = 'paimon',
    'warehouse' = 'cosn://xxxx/paimon_new'
    );
  2. create the table in the flink
    use catalog paimon_new;
    use `default;`
    CREATE TABLE pk_table_paimon_new (
    id INT,
    name STRING,
    age INT,
    PRIMARY KEY (id) NOT ENFORCED
    );
  3. write data to the table in the flink
    SET execution.checkpointing.interval=10s;
    CREATE TABLE default_catalog.default_database.source_table (
    id INT,
    name STRING,
    age INT
    ) WITH (
    'connector' = 'datagen',
    'fields.id.min' = '0',
    'fields.id.max' = '100',
    'fields.name.length' = '8',
    'fields.age.min' = '10',
    'fields.age.max' = '80'
    );
    INSERT INTO pk_table_paimon_new
    SELECT * FROM default_catalog.default_database.source_table;
  4. query in the StarRocks
    CREATE EXTERNAL CATALOG paimon
    PROPERTIES
    (
    "type" = "paimon",
    "paimon.catalog.type" = "filesystem",
    "paimon.catalog.warehouse" = 'cosn://xxxx/paimon_new'
    );
    set catalog paimon;
    use `default`;
    select * from pk_table_paimon_new;

    then the cn will crash and the fe will report ERROR 1064 (HY000): Backend node not found. Check if any backend node is down. image

You can not get any message from the cn.out. If you use explain analyze, you will get ERROR 1064 (HY000): Unknown error,the FE throws NPE. If you do not use the primary key table or when the table is compactioned, it's ok.(I have not test it yet)

flink version: 1.16.1 paimon: 0.8.2

Expected behavior (Required)

querry successs

Real behavior (Required)

query failed,and the cn aborted.

StarRocks version (Required)

3.3.2

mlbzssk commented 1 month ago

The problem of explain analyze may be about. image This will recurse indefinitely, causing the thread to exit, and the coord will be null, so it will cause NPE. image

Smith-Cruise commented 1 month ago

@miomiocat Do you know this problem?

Smith-Cruise commented 1 month ago

The problem of explain analyze may be about. image This will recurse indefinitely, causing the thread to exit, and the coord will be null, so it will cause NPE.

FE NPE can't cause cn to crash

mlbzssk commented 1 month ago

Yes, thre are 2 problems. First, do not use explain analyze,the cn will abort And the second, I want to use explain analyze to analyze the problem, The FE throw the NPE. I have found the cause of the second problem by debug, but the first problem I hvae no idea.