apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.25k stars 2.39k forks source link

[SUPPORT] Different keygen class assigned by Hudi in 0.11.1 and 0.12.1 while creating a table with multiple primary keys #7294

Open nikhilindikuzha opened 1 year ago

nikhilindikuzha commented 1 year ago

Hi Team, whenever I am trying to create hudi table with mutiple primary key , generating : hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator in hudi .12.1 but the same create statement in hudi .11.1 , keygenerator class as complex one. Any idea ?

Sample code:

` spark-shell \ --packages org.apache.hudi:hudi-spark-bundle2.12:0.12.1 \ --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' import org.apache.spark.sql.SaveMode. import org.apache.hudi.DataSourceWriteOptions. import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceReadOptions. import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.QuickstartUtils._ import org.apache.spark.sql.DataFrame

spark.sql("""create table .f_test5( id string, name string, age string, salary string, upd_ts string, job string ) using hudi partitioned by (job) location 'gs:///HUDI/f_test5/' options ( type = 'cow', primaryKey = 'id,name', preCombineField = 'upd_ts' )""") Colla`

Discussion in Hudi Slack channel: https://apache-hudi.slack.com/archives/C4D716NPQ/p1668596344997409

codope commented 1 year ago

Looks like the behavior changed in https://github.com/apache/hudi/commit/6fee77b76fa25be677b324804eae9801ca6b9f4c I think we should keep the complex keygen while overriding timestamp-type handling. cc @alexeykudinkin @xiarixiaoyao Any idea why did we remove ComplexKeyGenerator as base class?

nsivabalan commented 1 year ago

while we are at it. can you also explore how this change would work for existing tables. for eg, with 0.11.0 if a user did not explicitly set the key gen, it would pick ComplexKeyGen. and if the user upgrades to 0.12.1, wouldn't the default will be chosen as SimpleKeyGen. or is that taken care of already?

codope commented 1 year ago

Attempted to reproduce with below script (note that I created table with 0.11.1 and did insert with 0.12.1), the issue described by the OP is reproducible but upgrade doesn't override the keygen prop in hoodie.properties.

import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.QuickstartUtils._
import org.apache.spark.sql.DataFrame

spark.sql("""create table f_test111(
id string,
name string,
age string,
salary string,
upd_ts string,
job string
)
using hudi
partitioned by (job)
location 'file:///tmp/f_test111/'
options (
type = 'cow',
primaryKey = 'id,name',
preCombineField = 'upd_ts'
)""")

spark.sql("""insert into f_test111 values('a1', 'sagar', '32', '1000', '100', 'se')""")
jonvex commented 1 year ago

I put this test in TestCreateTable.scala and it succeeds without error in both 0.12.1 release branch and the current master

  test("Test Multiple Primary Key Default Keygen") {
    withTempDir { tmp =>
      val tableName = generateTableName
      spark.sql(
        s"""
           |create table $tableName (
           |  id int,
           |  name string,
           |  price double,
           |  ts long
           |) using hudi
           | partitioned by (name)
           | tblproperties (
           |  primaryKey = 'id,price',
           |  preCombineField = 'ts',
           |  type = 'cow'
           | )
           | location '${tmp.getCanonicalPath}'
       """.stripMargin)
      val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))
      val tablePath = table.storage.properties("path")
      val metaClient = HoodieTableMetaClient.builder()
        .setBasePath(tablePath)
        .setConf(spark.sessionState.newHadoopConf())
        .build()
      val tableConfig = metaClient.getTableConfig.getProps.asScala.toMap
      assertResult("org.apache.hudi.keygen.ComplexKeyGenerator")(tableConfig(HoodieTableConfig.KEY_GENERATOR_CLASS_NAME.key()))
      val source = scala.io.Source.fromFile(tmp.getCanonicalPath + "/.hoodie/hoodie.properties")
      val lines = try source.mkString finally source.close()
      assertResult(lines.contains("hoodie.table.recordkey.fields=id,price"))(true)
      assertResult(lines.contains("hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator"))(true)
    }
  }
jonvex commented 1 year ago

I just tried out my test with options instead of tblproperties and it still passed. So not sure what else there is to try

nsivabalan commented 1 year ago

@nikhilindikuzha : looks like we could not reprod. could you give us a reproducible script/runbook. Feel free to close if you could not reproduce.

nsivabalan commented 1 year ago

@nikhilindikuzha : any updates here please.