apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.33k stars 2.42k forks source link

[SUPPORT] Hudi primary key config is case-sensitive #11776

Open stream2000 opened 1 month ago

stream2000 commented 1 month ago

Tips before filing an issue

Describe the problem you faced

8350 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager [] - Lost task 0.0 in stage 9.0 (TID 8) (30.221.115.93 executor driver): org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "ID" cannot be null or empty.

the primary key config must be in lower case now.

To Reproduce

  test("Test primary key case sensitive") {
    withTempDir { tmp =>
      val tableName = generateTableName
      // Create a partitioned table
      spark.sql(
        s"""
           |create table $tableName (
           |  id int,
           |  name string,
           |  price double,
           |  ts long,
           |  dt string
           |) using hudi
           | tblproperties (primaryKey = 'ID'
           | )
           | partitioned by (dt)
           | location '${tmp.getCanonicalPath}'
       """.stripMargin)
      spark.sql(
        s"""
           | insert into $tableName
           | select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-01-05' as dt
        """.stripMargin)
      checkAnswer(s"select id, name, price, ts, dt from $tableName")(
        Seq(1, "a1", 10.0, 1000 , "2021-01-05")
      )
    }
  }

Expected behavior

the primary key config should be case - insensitive

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error. 8350 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager [] - Lost task 0.0 in stage 9.0 (TID 8) (30.221.115.93 executor driver): org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "ID" cannot be null or empty. at org.apache.hudi.keygen.KeyGenUtils.getRecordKey(KeyGenUtils.java:205) at org.apache.hudi.keygen.SimpleAvroKeyGenerator.getRecordKey(SimpleAvroKeyGenerator.java:50) at org.apache.hudi.keygen.SimpleKeyGenerator.getRecordKey(SimpleKeyGenerator.java:64) at org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:70) at org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$getRecordKey$1(SqlKeyGenerator.scala:79) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.hudi.command.SqlKeyGenerator.getRecordKey(SqlKeyGenerator.scala:79) at org.apache.hudi.HoodieCreateRecordUtils$.getHoodieKeyAndMaybeLocationFromAvroRecord(HoodieCreateRecordUtils.scala:206) at org.apache.hudi.HoodieCreateRecordUtils$.$anonfun$createHoodieRecordRdd$5(HoodieCreateRecordUtils.scala:133) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:224) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1418) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1482) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1305) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750

danny0405 commented 1 month ago

Nice findings, what is the release of Hudi?

stream2000 commented 1 month ago

Nice findings, what is the release of Hudi?

I tried it in latest branch

danny0405 commented 1 month ago

it's great if you can fire a fix for it.

stream2000 commented 1 month ago

it's great if you can fire a fix for it.

Sorry I'm a bit busy nowadays. Would be great if other contributer can take over it.

Gatsby-Lee commented 1 month ago

@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?

stream2000 commented 1 month ago

@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?

The field names in SQL should be case-insensitive IMO.

Gatsby-Lee commented 1 month ago

@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?

The field names in SQL should be case-insensitive IMO.

hmm I am curious. Wouldn’t it be better to make it case-sensitive and give the user the option to normalize the key? ( to be simple, I think it's better to keep it as a case-sensitive. )

danny0405 commented 1 month ago

I kind of think we should follow this ctiteria:

  1. if the case-insensitivity is enabled, and the primary key is defined from SQL, the primary should also be case-insensitive.
  2. otherwise, the primary key should be case-seneitive(for e.g. defined from sql or dataframe options.)
ad1happy2go commented 3 weeks ago

@stream2000 @Gatsby-Lee I had done this change some time back and even have test cases for the same. Do you see this issue with 0.15 hudi version also ?

https://github.com/apache/hudi/pull/9020

Gatsby-Lee commented 3 weeks ago

@Gatsby-Lee I don't see this issue because the primary key I use is already normalized to lower case.

stream2000 commented 1 week ago

@stream2000 @Gatsby-Lee I had done this change some time back and even have test cases for the same. Do you see this issue with 0.15 hudi version also ?

@ad1happy2go Yes, we can see this issue in 0.15 too. Because the above PR only deal with the config key, bu not deal with the config value ( which could be in upper-case)

ad1happy2go commented 1 week ago

@stream2000 Yeah right. I understand now. Thank.

ad1happy2go commented 1 week ago

@stream2000 Created jira for the same to track this improvement - https://issues.apache.org/jira/browse/HUDI-8172