apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.36k stars 2.42k forks source link

[SUPPORT] Upgrade from 0.6.0 to 0.15.0 #11738

Open wangjunjie-lnnf opened 2 months ago

wangjunjie-lnnf commented 2 months ago

We have much hudi table of version 0.6.0, We want to upgrade to version 0.14.1 or 0.15.0, so we did some test. When we write table of version 0.6.0 with client of 0.15.0, some error happen

To Reproduce

Steps to reproduce the behavior:

  1. create table with 0.6.0
  2. write table with 0.15.0

Expected behavior

upgrade success

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Exception in thread "main" org.apache.hudi.exception.HoodieException: Config conflict(key   current value   existing value):
RecordKey:  id  null
    at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:229)
HoodieWriterUtils.scala:229
    at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:232)
HoodieSparkSqlWriter.scala:232
    at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187)
HoodieSparkSqlWriter.scala:187
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125)
HoodieSparkSqlWriter.scala:125
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168)
DefaultSource.scala:168
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
SaveIntoDataSourceCommand.scala:47
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
commands.scala:75
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
commands.scala:73
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
commands.scala:84
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
QueryExecution.scala:97

the reason is table of 0.6.0 not record fields info in hoodie.properties

hoodie.table.precombine.field=x
hoodie.table.partition.fields=y
hoodie.table.recordkey.fields=x

but when write with client of 0.15.0, it did some validation, the error occur in the validation. the validation should skip when current table version is too low without need info in hoodie.properties

object HoodieSparkSqlWriter {

  private def writeInternal(sqlContext: SQLContext, 
                            mode: SaveMode,
                            optParams: Map[String, String],
                            ...) {

    var tableConfig = getHoodieTableConfig(sparkContext, path, mode, ...)

    // 验证option参数和hoodie.properties是否一致
    // 低版本的hoodie.properties没有记录字段信息
    validateTableConfig(sqlContext.sparkSession, optParams, tableConfig, mode == SaveMode.Overwrite)
  }
}

when we fill fields info in hoodie.properties of version 0.6.0 by ourself, the upgrade success.

by the way: is it safe upgrade from 0.6.0 to 0.15.0 or 0.14.1 ?

ad1happy2go commented 2 months ago

@wangjunjie-lnnf We should not directly upgrade version from 0.6 to 0.14. We should normally upgrade slowly based on table version.

wangjunjie-lnnf commented 2 months ago

https://hudi.apache.org/releases/release-0.14.0#migration-guide

When running a Hudi job with version 0.14.0 on a table with an older table version, an automatic upgrade process is triggered to bring the table up to version 6. This upgrade is a one-time occurrence for each Hudi table, as the hoodie.table.version is updated in the property file upon completion of the upgrade.

the migration-guide say it will automatic upgrade. the error above occur before the automatic upgrade function. when we bypass the HoodieWriterUtils$.validateTableConfig(...), the automatic upgrade was success.

ad1happy2go commented 2 months ago

@wangjunjie-lnnf Technically it can upgrade, but we never test any table upgrading from such a lower version to higher. So you might see some unforeseen issues even related to data consistency.

ad1happy2go commented 2 months ago

@wangjunjie-lnnf I suggest you to slowly upgrade to 0.10.1 first, then 0.12.3 and then 0.14.1.

Gatsby-Lee commented 2 months ago

@wangjunjie-lnnf Hudi supports the version upgrade when the Table version increment +1 I documented how I upgraded from 0.10.1 to 0.12.2. ( https://medium.com/@life-is-short-so-enjoy-it/aws-hudi-upgrade-to-0-12-2-from-0-10-1-emr-on-eks-48c300aa2c53 )

Each version has different bugs that might bother yours. You should go through each version if you want to be safe. ( every Table version change. Then you can skip some minor versions. )

If you don't have time and know everything works in 0.14.1 ( figuring out the lots of new and deprecated config are important ), then you can just re-write the whole data by using the latest version. it is not too bad.