[Feature] Support for schema evolution commit snapshot

apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

https://paimon.apache.org/

Apache License 2.0

2.44k stars 959 forks source link

[Feature] Support for schema evolution commit snapshot #1267

Closed thexiay closed 1 year ago

thexiay commented 1 year ago

Search before asking

[X] I searched in the issues and found nothing similar.

Motivation

Currently paimon does not support the snapshot submission of the schema evolution type. There may be some problems with this. Currently, before the snapshot is submitted, the latest file under the schema is searched, and then it is used as the schema id of this snapshot.

Maybe the instant of committing schema changes is not the time when you really want to make schema changes. This situation is especially obvious during the process of database table synchronization. for example:

schema:
(
    id int,
    name string,
    age int
)
->
(
    id int,
    name string
)

input data:
(1, 'xixi', 20),
(2, 'haha', 21),
(3, 'ooo')

In fact, schema evolution occurs between the second piece of data and the third piece of data, but the snapshot commit may occur at the end of the third piece of data, then the age field may have been able to be queried, but because the latest snapshot is used when submitting schema file, so the age field is hidden and cannot be queried.

Any other data lake like iceberg or hudi all has this kind of snapshot: schema evolution.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

JingsongLi commented 1 year ago

What is the behavior of schema evolution snapshot?

JingsongLi commented 1 year ago

Can you just create two commits for this case?

thexiay commented 1 year ago

What is the behavior of schema evolution snapshot?

The behavior of schema evolution snapshot is just modify the schema.And the data, manifest all of those do not change.

thexiay commented 1 year ago

Can you just create two commits for this case?

input datas:

(1, 'xixi', 20),
(2, 'haha', 21),
(schema change: id, name, age -> id, name)
(3, 'ooo')

schema has two:

schema-1:
(id int, name string, age int)
schema-2:
(id int, name string)

currenttly snapshot commited now maybe:

snapshot-1(with schema-2)(CommitKind.APPEND)

so if you query snapshot-1, you will found output looks like：

(1, 'xixi'),
(2, 'haha'),
(3, 'ooo')

the field of age disapper.

but if with schema evolution snapshot , it maybe look like：

snapshot-1(with schema-1)(CommitKind.APPEND)
snapshot-2(with schema-2)(CommitKind.SCHEMA_CHANGE)
snapshot-3(with schema-2)(CommitKind.APPEND)

so if you query snapshot-1, you will found output looks like：

(1, 'xixi', 20),
(2, 'haha', 21)

JingsongLi commented 1 year ago

Paimon just control schema separately. You can take a look to: https://paimon.apache.org/docs/master/concepts/file-operations/

So here we can just create two commits: snapshot-1(with schema-1) snapshot-2(with schema-2)

thexiay commented 1 year ago

Paimon just control schema separately. You can take a look to: https://paimon.apache.org/docs/master/concepts/file-operations/

So here we can just create two commits: snapshot-1(with schema-1) snapshot-2(with schema-2)

Do you mean that a new snapshot will be forced to be committed after the schema file generated? I'm going to try this case.

JingsongLi commented 1 year ago

Yes