[SUPPORT] Hudi schema evolution, Null for oldest values

Hi everyone, i'm trying to test schema evolution for my cdc pipline (debizium + kafka) with hudi 0.11.0 and spark structured streming , i follow this documentation, https://hudi.apache.org/docs/0.11.0/schema_evolution, does hudi manage well the schema evolution, it is necessary to restart the job, once it is done all the old values will be null, it takes into account only the values of the last commits, so the data not match my Postgres source ? on my schema registry i can see V1 and V2

any ideas thanks

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

Stop hudi streams, and drop hive tables

add some columns ALTER TABLE ADD COLUMN character varying(50) DEFAULT 'toto' ;

restart hudi spark jobs

select * from hudi _ro / _rt table ( or read parquet hudi format using spark)

Expected behavior

when i select my data it expected to see default value on the added column and not null values

data on postgres source:

cdc_hudi=> select test, test2, test3 from hudipart
;
 test | test2 | test3 
------+-------+-------
 toto | f     | Toto
 test | t     | Toto
 test | t     | Toto
 test | t     | Toto
 toto | f     | Toto
 toto | f     | Toto
 toto | f     | Toto
 toto | f     | Toto
 toto | f     | Toto
 toto | f     | Toto
 toto | f     | Toto
 test | t     | Toto
 test | t     | Toto
 test | t     | test3
 test | t     | test3
 toto | f     | Toto
 toto | f     | Toto
 toto | f     | Toto
 test | t     | test3
 toto | f     | Toto
 toto | f     | Toto
 toto | f     | Toto

data on hudi parquets / hive tables :

spark.sql("select _hoodie_commit_time as commitTime, test, test2, test3 from evolution ").show()
---------------------------------------
+-----------------+----+-----+-----+
|       commitTime|test|test2|test3|
+-----------------+----+-----+-----+
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824102514494|null| null| null|
|20220824132039517|null| null| null|
|20220824132113066|null| null| null|
|20220824132113066|null| null| null|
|20220824132934016|test| true| null|
|20220824135050368|test| true| null|
|20220824135411903|test| true| null|
|20220824135446080|test| true| null|
|20220824135921176|test| true|test3|

Environment Description

Hudi version : 0.11.0
Spark version :3.1.4
Hive version :1.2.1000
Hadoop version : 2.7.3
Storage (HDFS)
schema registry
Kafka
debezium

xiarixiaoyao commented 2 years ago

@Armelabdelkbir spark now not support default value， maybe https://github.com/apache/spark/pull/36672/files can help you， thanks

xiarixiaoyao commented 2 years ago

@Armelabdelkbir if you has this requirement for spark 3.1x pls raise a pr, and i will fixed it as soon as possible

codope commented 1 year ago

Closing as the issue has been triaged and we have a patch as suggested in https://github.com/apache/hudi/issues/6496#issuecomment-1228023484

apache / hudi

[SUPPORT] Hudi schema evolution, Null for oldest values #6496