Open stream2000 opened 1 month ago
Nice findings, what is the release of Hudi?
Nice findings, what is the release of Hudi?
I tried it in latest branch
it's great if you can fire a fix for it.
it's great if you can fire a fix for it.
Sorry I'm a bit busy nowadays. Would be great if other contributer can take over it.
@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?
@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?
The field names in SQL should be case-insensitive IMO.
@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?
The field names in SQL should be case-insensitive IMO.
hmm I am curious. Wouldn’t it be better to make it case-sensitive and give the user the option to normalize the key? ( to be simple, I think it's better to keep it as a case-sensitive. )
I kind of think we should follow this ctiteria:
@stream2000 @Gatsby-Lee I had done this change some time back and even have test cases for the same. Do you see this issue with 0.15 hudi version also ?
@Gatsby-Lee I don't see this issue because the primary key I use is already normalized to lower case.
@stream2000 @Gatsby-Lee I had done this change some time back and even have test cases for the same. Do you see this issue with 0.15 hudi version also ?
@ad1happy2go Yes, we can see this issue in 0.15 too. Because the above PR only deal with the config key, bu not deal with the config value ( which could be in upper-case)
@stream2000 Yeah right. I understand now. Thank.
@stream2000 Created jira for the same to track this improvement - https://issues.apache.org/jira/browse/HUDI-8172
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
8350 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager [] - Lost task 0.0 in stage 9.0 (TID 8) (30.221.115.93 executor driver): org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "ID" cannot be null or empty.
the primary key config must be in lower case now.
To Reproduce
Expected behavior
the primary key config should be case - insensitive
Environment Description
Hudi version : latest master branch
Spark version : 3.2.0
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
8350 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager [] - Lost task 0.0 in stage 9.0 (TID 8) (30.221.115.93 executor driver): org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "ID" cannot be null or empty. at org.apache.hudi.keygen.KeyGenUtils.getRecordKey(KeyGenUtils.java:205) at org.apache.hudi.keygen.SimpleAvroKeyGenerator.getRecordKey(SimpleAvroKeyGenerator.java:50) at org.apache.hudi.keygen.SimpleKeyGenerator.getRecordKey(SimpleKeyGenerator.java:64) at org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:70) at org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$getRecordKey$1(SqlKeyGenerator.scala:79) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.hudi.command.SqlKeyGenerator.getRecordKey(SqlKeyGenerator.scala:79) at org.apache.hudi.HoodieCreateRecordUtils$.getHoodieKeyAndMaybeLocationFromAvroRecord(HoodieCreateRecordUtils.scala:206) at org.apache.hudi.HoodieCreateRecordUtils$.$anonfun$createHoodieRecordRdd$5(HoodieCreateRecordUtils.scala:133) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:224) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1418) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1482) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1305) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750