apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
747 stars 260 forks source link

[Bug]: Mixed-Format Unkeyed Table will read the full amount of data even if configured with 'scan.startup.mode'='latest' #2977

Closed lklhdu closed 21 hours ago

lklhdu commented 6 days ago

What happened?

When I read an unkeyed amoro table of type mixed-format through Flink, I found that even with the 'scan.startup.mode'='latest' configuration item configured, the full amount of data is read, which is not a situation that meets expectations.

Affects Versions

master

What table formats are you seeing the problem on?

Mixed-Iceberg, Mixed-Hive

What engines are you seeing the problem on?

Flink

How to reproduce

  1. Create an unkeyed table of type mixed-format
    create table test_db.test_table (
    id int,
    name string, 
    age int
    ) using mixed_hive;
  2. Write several initial data
    insert into test_db.test_table values (1,'name1',10);
    insert into test_db.test_table values (2,'name2',10);
  3. Set 'scan.startup.mode' to latest to start reading data from the current latest snapshot
    select * from test_db.test_table
    /*+ OPTIONS('streaming'='true','arctic.read.mode'='file','source.parallelism' = '1','table.format'='MIXED_HIVE','scan.startup.mode'='latest') */;

    But it will read the full amount of data, not as expected.

Relevant log output

No response

Anything else

No response

Are you willing to submit a PR?

Code of Conduct