[Bug] Branch can only read data in latest schema

apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

Apache License 2.0

2.44k stars 959 forks source link

Search before asking

[X] I searched in the issues and found nothing similar.

Paimon version

1.0-SNAPSHOT

Compute Engine

Flink

Minimal reproduce step

Create table with schema version 1
Insert some values
Alter table to create schema version 2 (but don't insert any new values)
Create tag & branch
Select * from branch

The select job will fail because it only has schema version 2 but will try to read schema version 1. Example stacktrace: https://gist.github.com/gmdfalk/802eb18c912a4d85e17f206820a0c55a

What doesn't meet your expectations?

I expect the branch to be able to read any data, not just in the latest schema. The branch only knows schema 2 and cannot read entries written in schema 1.

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

@gmdfalk Thanks for your report! I would like to update with more detailed reproduction steps:

CREATE TABLE T (
 pt INT,
 k INT,
 v STRING,
 PRIMARY KEY (pt, k) NOT ENFORCED
 ) PARTITIONED BY (pt) WITH (
 'bucket' = '2'
)

INSERT INTO T VALUES (1, 10, 'apple'), (1, 20, 'banana')
ALTER TABLETADD (v2 INT)
INSERT INTO T VALUES (2, 10, 'cat', 2), (2, 20, 'dog', 2)
CALL sys.create_tag('default.T', 'tag1', 2)
CALL sys.create_branch('default.T', 'test', 'tag1')
SELECT * FROM T$branch_test

The reason for this bug is that only the tag or latest schema is copied when creating a branch.

apache / paimon