apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.44k stars 959 forks source link

[Bug] Branch can only read data in latest schema #4407

Closed gmdfalk closed 2 weeks ago

gmdfalk commented 3 weeks ago

Search before asking

Paimon version

1.0-SNAPSHOT

Compute Engine

Flink

Minimal reproduce step

  1. Create table with schema version 1
  2. Insert some values
  3. Alter table to create schema version 2 (but don't insert any new values)
  4. Create tag & branch
  5. Select * from branch

The select job will fail because it only has schema version 2 but will try to read schema version 1. Example stacktrace: https://gist.github.com/gmdfalk/802eb18c912a4d85e17f206820a0c55a

What doesn't meet your expectations?

I expect the branch to be able to read any data, not just in the latest schema. The branch only knows schema 2 and cannot read entries written in schema 1.

Anything else?

No response

Are you willing to submit a PR?

liming30 commented 3 weeks ago

@gmdfalk Thanks for your report! I would like to update with more detailed reproduction steps:

CREATE TABLE T (
 pt INT,
 k INT,
 v STRING,
 PRIMARY KEY (pt, k) NOT ENFORCED
 ) PARTITIONED BY (pt) WITH (
 'bucket' = '2'
)
  1. INSERT INTO T VALUES (1, 10, 'apple'), (1, 20, 'banana')
  2. ALTER TABLETADD (v2 INT)
  3. INSERT INTO T VALUES (2, 10, 'cat', 2), (2, 20, 'dog', 2)
  4. CALL sys.create_tag('default.T', 'tag1', 2)
  5. CALL sys.create_branch('default.T', 'test', 'tag1')
  6. SELECT * FROM T$branch_test

The reason for this bug is that only the tag or latest schema is copied when creating a branch.