apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.4k stars 2.43k forks source link

[SUPPORT] spark task execute too long and can not finish when ObjectSizeCalculator.getObjectSize #11879

Open KnightChess opened 2 months ago

KnightChess commented 2 months ago

Tips before filing an issue

Describe the problem you faced

like #10504 , in different func

java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:396)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.callAgent(ServiceabilityAgentSupport.java:190)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.callAgent(ServiceabilityAgentSupport.java:163)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.getUniverseData(ServiceabilityAgentSupport.java:301)
org.apache.hudi.org.openjdk.jol.vm.VM.current(VM.java:77)
org.apache.hudi.org.openjdk.jol.info.GraphWalker.walk(GraphWalker.java:97)
org.apache.hudi.org.openjdk.jol.info.GraphLayout.parseInstance(GraphLayout.java:54)
org.apache.hudi.common.util.ObjectSizeCalculator.getObjectSize(ObjectSizeCalculator.java:57)
org.apache.hudi.common.util.HoodieRecordSizeEstimator.<init>(HoodieRecordSizeEstimator.java:40)
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:107)
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:74)
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:465)
org.apache.hudi.LogFileIterator$.$anonfun$scanLog$1(Iterators.scala:329)
org.apache.hudi.LogFileIterator$$$Lambda$1054/69444513.apply(Unknown Source)
org.apache.spark.sql.hive.HadoopUgiUtils$$anon$1.run(HadoopUgiUtils.scala:54)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1989)
org.apache.spark.sql.hive.HadoopUgiUtils$.doAsWithHiveSuperUser(HadoopUgiUtils.scala:53)
org.apache.hudi.LogFileIterator$.scanLog(Iterators.scala:261)
org.apache.hudi.LogFileIterator.<init>(Iterators.scala:93)
org.apache.hudi.RecordMergingFileIterator.<init>(Iterators.scala:173)
org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:100)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)
org.apache.spark.rdd.RDD.iterator(RDD.scala:333)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)
org.apache.spark.rdd.RDD.iterator(RDD.scala:333)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)
org.apache.spark.rdd.RDD.iterator(RDD.scala:333)
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
org.apache.spark.executor.Executor$TaskRunner$$Lambda$476/220247377.apply(Unknown Source)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1463)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

To Reproduce

can not reporduce

Expected behavior

Environment Description

KnightChess commented 2 months ago

@yihua met again ~ ~, total 6000 read task, only 1 task has run 2hour not finish, the thread is blocked here:

java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:396)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.callAgent(ServiceabilityAgentSupport.java:190)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.callAgent(ServiceabilityAgentSupport.java:163)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.getUniverseData(ServiceabilityAgentSupport.java:301)
org.apache.hudi.org.openjdk.jol.vm.VM.current(VM.java:77)
org.apache.hudi.org.openjdk.jol.info.GraphWalker.walk(GraphWalker.java:97)
org.apache.hudi.org.openjdk.jol.info.GraphLayout.parseInstance(GraphLayout.java:54)
org.apache.hudi.common.util.ObjectSizeCalculator.getObjectSize(ObjectSizeCalculator.java:57)
org.apache.hudi.common.util.HoodieRecordSizeEstimator.<init>(HoodieRecordSizeEstimator.java:40)
danny0405 commented 2 months ago

The jol ccode is buggy, we might need to refactor it out.

KnightChess commented 2 months ago

@danny0405 hi, do you have any suggestions?

yihua commented 2 months ago

Hi @KnightChess 🤝 Does this issue come from master or an existing Hudi release?

KnightChess commented 2 months ago

@yihua hudi 0.13.1, master still use jol ObjectSizeCalculator.getObjectSize, so I think may have this issue too.

danny0405 commented 2 months ago

@danny0405 hi, do you have any suggestions?

Would you mind to have some research about the alternatives we can take?

KnightChess commented 1 month ago

@danny0405 yeah, I'll check it.

waywtdcc commented 1 month ago

How is it?

danny0405 commented 1 month ago

@waywtdcc Issue https://github.com/apache/hudi/issues/10580 provides a temporary solution.

KnightChess commented 4 days ago

@danny0405 met again and not found a better way, plan to rollback HUDI-4687 with jdk8

danny0405 commented 3 days ago

Does it stil work for JDK above 1.8? The community has a plan to drop support for JDK8 though.

KnightChess commented 2 days ago

@danny0405 as HUDI-4687 describe, can not work. I mean in our internal version rollback first.