Closed thexiay closed 1 year ago
What is the behavior of schema evolution
snapshot?
Can you just create two commits for this case?
What is the behavior of
schema evolution
snapshot?
The behavior of schema evolution
snapshot is just modify the schema.And the data, manifest all of those do not change.
Can you just create two commits for this case?
input datas:
(1, 'xixi', 20),
(2, 'haha', 21),
(schema change: id, name, age -> id, name)
(3, 'ooo')
schema has two:
schema-1:
(id int, name string, age int)
schema-2:
(id int, name string)
currenttly snapshot commited now maybe:
snapshot-1(with schema-2)(CommitKind.APPEND)
so if you query snapshot-1, you will found output looks like:
(1, 'xixi'),
(2, 'haha'),
(3, 'ooo')
the field of age disapper.
but if with schema evolution
snapshot , it maybe look like:
snapshot-1(with schema-1)(CommitKind.APPEND)
snapshot-2(with schema-2)(CommitKind.SCHEMA_CHANGE)
snapshot-3(with schema-2)(CommitKind.APPEND)
so if you query snapshot-1, you will found output looks like:
(1, 'xixi', 20),
(2, 'haha', 21)
Paimon just control schema separately. You can take a look to: https://paimon.apache.org/docs/master/concepts/file-operations/
So here we can just create two commits: snapshot-1(with schema-1) snapshot-2(with schema-2)
Paimon just control schema separately. You can take a look to: https://paimon.apache.org/docs/master/concepts/file-operations/
So here we can just create two commits: snapshot-1(with schema-1) snapshot-2(with schema-2)
Do you mean that a new snapshot will be forced to be committed after the schema file generated? I'm going to try this case.
Yes
Search before asking
Motivation
Currently paimon does not support the snapshot submission of the schema evolution type. There may be some problems with this. Currently, before the snapshot is submitted, the latest file under the schema is searched, and then it is used as the schema id of this snapshot.
Maybe the instant of committing schema changes is not the time when you really want to make schema changes. This situation is especially obvious during the process of database table synchronization. for example:
In fact, schema evolution occurs between the second piece of data and the third piece of data, but the snapshot commit may occur at the end of the third piece of data, then the age field may have been able to be queried, but because the latest snapshot is used when submitting schema file, so the age field is hidden and cannot be queried.
Any other data lake like iceberg or hudi all has this kind of snapshot: schema evolution.
Solution
No response
Anything else?
No response
Are you willing to submit a PR?