apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[SUPPORT] Hudi schema evolution, Null for oldest values #6496

Closed Armelabdelkbir closed 1 year ago

Armelabdelkbir commented 2 years ago

Hi everyone, i'm trying to test schema evolution for my cdc pipline (debizium + kafka) with hudi 0.11.0 and spark structured streming , i follow this documentation, https://hudi.apache.org/docs/0.11.0/schema_evolution, does hudi manage well the schema evolution, it is necessary to restart the job, once it is done all the old values will be null, it takes into account only the values of the last commits, so the data not match my Postgres source ? on my schema registry i can see V1 and V2

any ideas thanks

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. Stop hudi streams, and drop hive tables
  2. add some columns ALTER TABLE ADD COLUMN character varying(50) DEFAULT 'toto' ;
  3. restart hudi spark jobs
  4. select * from hudi _ro / _rt table ( or read parquet hudi format using spark)
  5. Expected behavior

    when i select my data it expected to see default value on the added column and not null values

    data on postgres source:

    cdc_hudi=> select test, test2, test3 from hudipart
    ;
     test | test2 | test3 
    ------+-------+-------
     toto | f     | Toto
     test | t     | Toto
     test | t     | Toto
     test | t     | Toto
     toto | f     | Toto
     toto | f     | Toto
     toto | f     | Toto
     toto | f     | Toto
     toto | f     | Toto
     toto | f     | Toto
     toto | f     | Toto
     test | t     | Toto
     test | t     | Toto
     test | t     | test3
     test | t     | test3
     toto | f     | Toto
     toto | f     | Toto
     toto | f     | Toto
     test | t     | test3
     toto | f     | Toto
     toto | f     | Toto
     toto | f     | Toto

    data on hudi parquets / hive tables :

    spark.sql("select _hoodie_commit_time as commitTime, test, test2, test3 from evolution ").show()
    ---------------------------------------
    +-----------------+----+-----+-----+
    |       commitTime|test|test2|test3|
    +-----------------+----+-----+-----+
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824102514494|null| null| null|
    |20220824132039517|null| null| null|
    |20220824132113066|null| null| null|
    |20220824132113066|null| null| null|
    |20220824132934016|test| true| null|
    |20220824135050368|test| true| null|
    |20220824135411903|test| true| null|
    |20220824135446080|test| true| null|
    |20220824135921176|test| true|test3|

    Environment Description

    • Hudi version : 0.11.0

    • Spark version :3.1.4

    • Hive version :1.2.1000

    • Hadoop version : 2.7.3

    • Storage (HDFS)

    • schema registry

    • Kafka

    • debezium

    xiarixiaoyao commented 2 years ago

    @Armelabdelkbir spark now not support default value, maybe https://github.com/apache/spark/pull/36672/files can help you, thanks

    xiarixiaoyao commented 2 years ago

    @Armelabdelkbir if you has this requirement for spark 3.1x pls raise a pr, and i will fixed it as soon as possible

    codope commented 1 year ago

    Closing as the issue has been triaged and we have a patch as suggested in https://github.com/apache/hudi/issues/6496#issuecomment-1228023484