apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

Offline compaction schedule failing with Error fetching partition paths from metadata table #8984

Open koochiswathiTR opened 1 year ago

koochiswathiTR commented 1 year ago

Tips before filing an issue

Hi, Im trying to schedule hudi offline compaction

Below is the spark submit

spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4 --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar --base-path s3://a206760-novusnorm-s3-ci-use1/novusnorm/ --table-name novusnorm --spark-memory 5g --mode schedule

In our hoodi table ,We didnt see any metadata files under .hoodie folder. Please help here

2023-06-15T10:40:18.976+0000 [ERROR] [offline_compaction_schedule] [org.apache.hudi.utilities.UtilHelpers] [UtilHelpers]: Compact failed org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) at org.apache.hudi.table.action.compact.HoodieCompactor.generateCompactionPlan(HoodieCompactor.java:279) at org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.scheduleCompaction(ScheduleCompactionActionExecutor.java:123) at org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.execute(ScheduleCompactionActionExecutor.java:93) at org.apache.hudi.table.HoodieSparkMergeOnReadTable.scheduleCompaction(HoodieSparkMergeOnReadTable.java:133) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleTableServiceInternal(BaseHoodieWriteClient.java:1348) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleTableService(BaseHoodieWriteClient.java:1325) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleCompactionAtInstant(BaseHoodieWriteClient.java:1003) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleCompaction(BaseHoodieWriteClient.java:994) at org.apache.hudi.utilities.HoodieCompactor.doSchedule(HoodieCompactor.java:281) at org.apache.hudi.utilities.HoodieCompactor.lambda$compact$0(HoodieCompactor.java:194)

A clear and concise description of the problem.

To Reproduce

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

[HoodieBackedTableMetadata]: Metadata table was not found at path s3://a206760-novusnorm-s3-ci-use1/novusnorm/.hoodie/metadata 2023-06-15T10:40:18.015+0000 [WARN] [offline_compaction_schedule] [org.apache.spark.scheduler.TaskSetManager] [TaskSetManager]: Lost task 0.0 in stage 0.0 (TID 0) (ip-100-66-72-199.3175.aws-int.thomsonreuters.com executor 2): java.io.IOException: unexpected exception type at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274) ... 40 more Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization at org.apache.hudi.metadata.FileSystemBackedTableMetadata.$deserializeLambda$(FileSystemBackedTableMetadata.java:46) ... 50 more

2023-06-15T10:40:18.950+0000 [ERROR] [offline_compaction_schedule] [org.apache.spark.scheduler.TaskSetManager] [TaskSetManager]: Task 0 in stage 0.0 failed 4 times; aborting job 2023-06-15T10:40:18.964+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]: Stopping Javalin ... 2023-06-15T10:40:18.975+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]: Javalin has stopped 2023-06-15T10:40:18.976+0000 [ERROR] [offline_compaction_schedule] [org.apache.hudi.utilities.UtilHelpers] [UtilHelpers]: Compact failed org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) at org.apache.hudi.table.action.compact.HoodieCompactor.generateCompactionPlan(HoodieCompactor.java:279) at org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.scheduleCompaction(ScheduleCompactionActionExecutor.java:123) at org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor.execute(ScheduleCompactionActionExecutor.java:93) at org.apache.hudi.table.HoodieSparkMergeOnReadTable.scheduleCompaction(HoodieSparkMergeOnReadTable.java:133) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleTableServiceInternal(BaseHoodieWriteClient.java:1348) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleTableService(BaseHoodieWriteClient.java:1325) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleCompactionAtInstant(BaseHoodieWriteClient.java:1003) at org.apache.hudi.client.BaseHoodieWriteClient.scheduleCompaction(BaseHoodieWriteClient.java:994) at org.apache.hudi.utilities.HoodieCompactor.doSchedule(HoodieCompactor.java:281) at org.apache.hudi.utilities.HoodieCompactor.lambda$compact$0(HoodieCompactor.java:194) at org.apache.hudi.utilities.UtilHelpers.retry(UtilHelpers.java:540) at org.apache.hudi.utilities.HoodieCompactor.compact(HoodieCompactor.java:190) at org.apache.hudi.utilities.HoodieCompactor.main(HoodieCompactor.java:176) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (ip-100-66-72-199.3175.aws-int.thomsonreuters.com executor 1): java.io.IOException: unexpected exception type at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274) ... 40 more Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization at org.apache.hudi.metadata.FileSystemBackedTableMetadata.$deserializeLambda$(FileSystemBackedTableMetadata.java:46) ... 50 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2610) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2559) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2558) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2558) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1200) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1200) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1200) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2798) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2740) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2729) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:978) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2215) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2255) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2280) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362) at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361) at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:103) at org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:85) at org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:117) at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313) ... 25 more Caused by: java.io.IOException: unexpected exception type at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1274) ... 40 more Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization at org.apache.hudi.metadata.FileSystemBackedTableMetadata.$deserializeLambda$(FileSystemBackedTableMetadata.java:46) ... 50 more 2023-06-15T10:40:18.989+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.AbstractConnector] [AbstractConnector]: Stopped Spark@4f186450{HTTP/1.1, (http/1.1)}{0.0.0.0:8090} Command exiting with ret '0'

koochiswathiTR commented 1 year ago

Updated at 2023-04-18T10:44:15.775Z

Tue Apr 18 10:44:15 GMT 2023

hoodie.table.timeline.timezone=LOCAL hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator hoodie.table.precombine.field=operationTime hoodie.table.version=4 hoodie.database.name= hoodie.datasource.write.hive_style_partitioning=false hoodie.table.checksum=4079573748 hoodie.partition.metafile.use.base.format=false hoodie.archivelog.folder=archived hoodie.table.name=novusnorm hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload hoodie.populate.meta.fields=true hoodie.table.type=MERGE_ON_READ hoodie.datasource.write.partitionpath.urlencode=false hoodie.table.base.file.format=PARQUET hoodie.datasource.write.drop.partition.columns=false hoodie.table.metadata.partitions= hoodie.timeline.layout.version=1 hoodie.table.recordkey.fields=guid hoodie.table.partition.fields=collectionName

If you see hoodie.table.metadata.partitions= is empty.

@nsivabalan @ad1happy2go @soumilshah1995

ad1happy2go commented 1 year ago

@koochiswathiTR Its clearly saying Metadata table was not found at path s3://a206760-novusnorm-s3-ci-use1/novusnorm/.hoodie/metadata.

Can you let us know how you write this table? Looks like metadata was not enabled only when this table was written. By any chance, is this table got written with old hudi version?

koochiswathiTR commented 1 year ago

@ad1happy2go we have not enabled metadata. without enabling metadata cant we go for offline compaction?

ad1happy2go commented 1 year ago

@koochiswathiTR We can do, but when running compaction, it is somehow checking metadata.

Can you try explicitly disable metadata while running compaction.

koochiswathiTR commented 1 year ago

spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 --verbose --driver-memory 1g --executor-memory 1g --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar --table-name novusnorm --base-path s3://a206760-novusnorm-s3-ci-use1/novusnorm --mode scheduleandexecute --spark-memory 1g --hoodie-conf hoodie.metadata.enable=false --strategy "org.apache.hudi.table.action.compact.strategy.CompactionTriggerStrategy"

When I pass --strategy "org.apache.hudi.table.action.compact.strategy.CompactionTriggerStrategy"
it say ClassNOtfound

When I try to remove --strategy it says

image

I want to trigger compaction based on number of commits Pls help @ad1happy2go @ad1happy2go @soumilshah1995 @nsivabalan

koochiswathiTR commented 1 year ago

image

org.apache.hudi.utilities.HoodieCompactor uses only org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy? I want to run compaction based on number of commits Pls help @ad1happy2go @ad1happy2go @soumilshah1995 @nsivabalan

soumilshah1995 commented 1 year ago

Thanks we shall take a look at that shortly

ad1happy2go commented 1 year ago

@koochiswathiTR Can you share us the timeline please.

koochiswathiTR commented 1 year ago

@ad1happy2go @soumilshah1995

compaction triggered with the below command

spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 --verbose --driver-memory 2g --executor-memory 2g --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar --table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode scheduleandexecute --spark-memory 2g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=NUM_COMMITS --hoodie-conf hoodie.compact.inline.max.delta.commits=100

But next time when I tried to run compaction, I dont see its working. As per my understanding, the compaction should take the earliest instant time found in timeline looks that that is not happening. Please help My Hudi timeline is image

image

image

koochiswathiTR commented 1 year ago

image

@soumilshah1995 @ad1happy2go

koochiswathiTR commented 1 year ago

[hadoop@ip-100-66-69-75 a206760-PowerUser2]$ spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 --verbose --driver-memory 4g --executor-memory 16g --num-executors 8 --driver-cores 10 --executor-cores 10 --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar --table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode scheduleandexecute --spark-memory 2g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=NUM_COMMITS --hoodie-conf hoodie.compact.inline.max.delta.commits=5 2023-06-19T10:26:47.109+0000: [GC pause (G1 Evacuation Pause) (young), 0.0037454 secs] [Parallel Time: 1.6 ms, GC Workers: 8] [GC Worker Start (ms): Min: 418.9, Avg: 419.0, Max: 419.0, Diff: 0.1] [Ext Root Scanning (ms): Min: 0.1, Avg: 0.2, Max: 0.4, Diff: 0.3, Sum: 1.8] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 0.6] [Object Copy (ms): Min: 0.9, Avg: 1.0, Max: 1.1, Diff: 0.3, Sum: 8.1] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2] [Termination Attempts: Min: 1, Avg: 6.9, Max: 12, Diff: 11, Sum: 55] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1] [GC Worker Total (ms): Min: 1.3, Avg: 1.4, Max: 1.4, Diff: 0.1, Sum: 10.9] [GC Worker End (ms): Min: 420.3, Avg: 420.3, Max: 420.3, Diff: 0.0] [Code Root Fixup: 0.0 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.1 ms] [Other: 2.0 ms] [Choose CSet: 0.0 ms] [Ref Proc: 1.7 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.0 ms] [Eden: 24576.0K(24576.0K)->0.0B(34816.0K) Survivors: 0.0B->3072.0K Heap: 24576.0K(496.0M)->4071.5K(496.0M)] [Times: user=0.01 sys=0.00, real=0.00 secs] 2023-06-19T10:26:47.455+0000: [GC pause (G1 Evacuation Pause) (young), 0.0053984 secs] [Parallel Time: 2.8 ms, GC Workers: 8] [GC Worker Start (ms): Min: 764.9, Avg: 765.1, Max: 766.4, Diff: 1.5] [Ext Root Scanning (ms): Min: 0.0, Avg: 0.3, Max: 0.9, Diff: 0.9, Sum: 2.4] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1] [Processed Buffers: Min: 0, Avg: 0.1, Max: 1, Diff: 1, Sum: 1] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.6, Diff: 0.6, Sum: 0.7] [Object Copy (ms): Min: 0.9, Avg: 1.9, Max: 2.4, Diff: 1.5, Sum: 15.2] [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 1.5] [Termination Attempts: Min: 1, Avg: 15.1, Max: 28, Diff: 27, Sum: 121] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1] [GC Worker Total (ms): Min: 1.2, Avg: 2.5, Max: 2.7, Diff: 1.5, Sum: 19.9] [GC Worker End (ms): Min: 767.6, Avg: 767.6, Max: 767.6, Diff: 0.0] [Code Root Fixup: 0.0 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.2 ms] [Other: 2.4 ms] [Choose CSet: 0.0 ms] [Ref Proc: 2.0 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.0 ms] [Eden: 34816.0K(34816.0K)->0.0B(292.0M) Survivors: 3072.0K->5120.0K Heap: 39486.1K(496.0M)->7351.0K(496.0M)] [Times: user=0.02 sys=0.01, real=0.01 secs] Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default property: spark.yarn.appMasterEnv.bigdataEnv=bigdata_environment:dev,bigdata_project:tacticalnovusingest,bigdata_environment-type:DEVELOPMENT,bigdata_region:us-east-1,bigdata_servicename:tactical-novus-ingest,bigdata_version:dev4856801 Adding default property: spark.sql.warehouse.dir=hdfs:///user/spark/warehouse Adding default property: spark.yarn.dist.files=/etc/hudi/conf/hudi-defaults.conf Adding default property: spark.sql.parquet.fs.optimized.committer.optimization-enabled=true Adding default property: spark.executorEnv.regionShortName=use1 Adding default property: spark.executor.extraJavaOptions=-Dcom.amazonaws.sdk.disableCbor=true -Duser.timezone=GMT -verbose:gc -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MetaspaceSize=300M Adding default property: spark.history.fs.logDirectory=hdfs:///var/log/spark/apps Adding default property: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem=2 Adding default property: spark.hadoop.mapreduce.output.fs.optimized.committer.enabled=true Adding default property: spark.yarn.appMasterEnv.assetId=a206760 Adding default property: spark.sql.autoBroadcastJoinThreshold=104857600 Adding default property: spark.eventLog.enabled=true Adding default property: spark.shuffle.service.enabled=false Adding default property: spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native Adding default property: spark.emr.default.executor.memory=18971M Adding default property: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 Adding default property: spark.kryoserializer.buffer.max=1024m Adding default property: spark.yarn.historyServer.address=ip-100-66-69-75.3175.aws-int.thomsonreuters.com:18080 Adding default property: spark.stage.attempt.ignoreOnDecommissionFetchFailure=true Adding default property: spark.yarn.appMasterEnv.regionFullName=us-east-1 Adding default property: spark.yarn.appMasterEnv.regionShortName=use1 Adding default property: spark.storage.decommission.shuffleBlocks.enabled=true Adding default property: spark.executorEnv.regionFullName=us-east-1 Adding default property: spark.rpc.askTimeout=480 Adding default property: spark.sql.streaming.metricsEnabled=true Adding default property: spark.locality.wait=6s Adding default property: spark.driver.memory=2048M Adding default property: spark.decommission.enabled=true Adding default property: spark.files.fetchFailure.unRegisterOutputOnHost=true Adding default property: spark.executorEnv.assetId=a206760 Adding default property: spark.executor.defaultJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Dfile.encoding=UTF-8 Adding default property: spark.resourceManager.cleanupExpiredHost=true Adding default property: spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS=$(hostname -f) Adding default property: spark.sql.emr.internal.extensions=com.amazonaws.emr.spark.EmrSparkSessionExtensions Adding default property: spark.emr.default.executor.cores=4 Adding default property: spark.driver.extraJavaOptions=-Dcom.amazonaws.sdk.disableCbor=true -Duser.timezone=GMT -verbose:gc -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MetaspaceSize=300M Adding default property: spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds=2000 Adding default property: spark.deploy.mode=cluster Adding default property: spark.master=yarn Adding default property: spark.sql.parquet.output.committer.class=com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter Adding default property: spark.rpc.message.maxSize=416 Adding default property: spark.driver.defaultJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Dfile.encoding=UTF-8 Adding default property: spark.executorEnv.correlationId=offline_compaction_schedule Adding default property: spark.blacklist.decommissioning.timeout=1h Adding default property: spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native Adding default property: fs.s3.maxRetries=1000000 Adding default property: spark.sql.hive.metastore.sharedPrefixes=com.amazonaws.services.dynamodbv2 Adding default property: spark.executor.memory=18971M Adding default property: spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar Adding default property: spark.eventLog.dir=hdfs:///var/log/spark/apps Adding default property: spark.executorEnv.bigdataEnv=bigdata_environment:dev,bigdata_project:tacticalnovusingest,bigdata_environment-type:DEVELOPMENT,bigdata_region:us-east-1,bigdata_servicename:tactical-novus-ingest,bigdata_version:dev4856801 Adding default property: spark.dynamicAllocation.enabled=false Adding default property: spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar Adding default property: spark.executor.cores=4 Adding default property: spark.history.ui.port=18080 Adding default property: spark.blacklist.decommissioning.enabled=true Adding default property: spark.yarn.appMasterEnv.correlationId=offline_compaction_schedule Adding default property: spark.decommissioning.timeout.threshold=20 Adding default property: spark.yarn.heterogeneousExecutors.enabled=false Adding default property: spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem=true Adding default property: spark.hadoop.yarn.timeline-service.enabled=false Adding default property: spark.yarn.executor.memoryOverheadFactor=0.1875 Warning: Ignoring non-Spark config property: fs.s3.maxRetries Parsed arguments: master yarn deployMode null executorMemory 16g executorCores 10 totalExecutorCores null propertiesFile /usr/lib/spark/conf/spark-defaults.conf driverMemory 4g driverCores 10 driverExtraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar driverExtraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native driverExtraJavaOptions -Dcom.amazonaws.sdk.disableCbor=true -Duser.timezone=GMT -verbose:gc -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MetaspaceSize=300M supervise false queue null numExecutors 8 files null pyFiles null archives null mainClass org.apache.hudi.utilities.HoodieCompactor primaryResource file:/usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar name org.apache.hudi.utilities.HoodieCompactor childArgs [--table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode scheduleandexecute --spark-memory 2g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=NUM_COMMITS --hoodie-conf hoodie.compact.inline.max.delta.commits=5] jars null packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 packagesExclusions null repositories null verbose true

Spark properties used, including those specified through --conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf: (spark.sql.emr.internal.extensions,com.amazonaws.emr.spark.EmrSparkSessionExtensions) (spark.executor.defaultJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Dfile.encoding=UTF-8) (spark.blacklist.decommissioning.timeout,1h) (spark.yarn.appMasterEnv.correlationId,offline_compaction_schedule) (spark.yarn.executor.memoryOverheadFactor,0.1875) (spark.executorEnv.correlationId,offline_compaction_schedule) (spark.executorEnv.regionShortName,use1) (spark.blacklist.decommissioning.enabled,true) (spark.executor.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native) (spark.executorEnv.assetId,a206760) (spark.hadoop.yarn.timeline-service.enabled,false) (spark.driver.memory,4g) (spark.executor.memory,18971M) (spark.executorEnv.bigdataEnv,bigdata_environment:dev,bigdata_project:tacticalnovusingest,bigdata_environment-type:DEVELOPMENT,bigdata_region:us-east-1,bigdata_servicename:tactical-novus-ingest,bigdata_version:dev4856801) (spark.sql.parquet.fs.optimized.committer.optimization-enabled,true) (spark.sql.warehouse.dir,hdfs:///user/spark/warehouse) (spark.driver.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native) (spark.yarn.historyServer.address,ip-100-66-69-75.3175.aws-int.thomsonreuters.com:18080) (spark.yarn.heterogeneousExecutors.enabled,false) (spark.rpc.message.maxSize,416) (spark.eventLog.enabled,true) (spark.storage.decommission.shuffleBlocks.enabled,true) (spark.yarn.dist.files,/etc/hudi/conf/hudi-defaults.conf) (spark.files.fetchFailure.unRegisterOutputOnHost,true) (spark.history.ui.port,18080) (spark.stage.attempt.ignoreOnDecommissionFetchFailure,true) (spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds,2000) (spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS,$(hostname -f)) (spark.rpc.askTimeout,480) (spark.sql.streaming.metricsEnabled,true) (spark.driver.defaultJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Dfile.encoding=UTF-8) (spark.serializer,org.apache.spark.serializer.KryoSerializer) (spark.executor.extraJavaOptions,-Dcom.amazonaws.sdk.disableCbor=true -Duser.timezone=GMT -verbose:gc -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MetaspaceSize=300M) (spark.resourceManager.cleanupExpiredHost,true) (spark.deploy.mode,cluster) (spark.history.fs.logDirectory,hdfs:///var/log/spark/apps) (spark.shuffle.service.enabled,false) (spark.yarn.appMasterEnv.regionFullName,us-east-1) (spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version,2) (spark.locality.wait,6s) (spark.emr.default.executor.cores,4) (spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem,2) (spark.driver.extraJavaOptions,-Dcom.amazonaws.sdk.disableCbor=true -Duser.timezone=GMT -verbose:gc -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MetaspaceSize=300M) (spark.kryoserializer.buffer.max,1024m) (spark.hadoop.mapreduce.output.fs.optimized.committer.enabled,true) (spark.yarn.appMasterEnv.regionShortName,use1) (spark.executor.extraClassPath,/usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar) (spark.sql.hive.metastore.sharedPrefixes,com.amazonaws.services.dynamodbv2) (spark.eventLog.dir,hdfs:///var/log/spark/apps) (spark.executorEnv.regionFullName,us-east-1) (spark.master,yarn) (spark.emr.default.executor.memory,18971M) (spark.decommission.enabled,true) (spark.dynamicAllocation.enabled,false) (spark.yarn.appMasterEnv.assetId,a206760) (spark.sql.autoBroadcastJoinThreshold,104857600) (spark.sql.parquet.output.committer.class,com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter) (spark.yarn.appMasterEnv.bigdataEnv,bigdata_environment:dev,bigdata_project:tacticalnovusingest,bigdata_environment-type:DEVELOPMENT,bigdata_region:us-east-1,bigdata_servicename:tactical-novus-ingest,bigdata_version:dev4856801) (spark.executor.cores,4) (spark.decommissioning.timeout.threshold,20) (spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem,true) (spark.driver.extraClassPath,/usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar)

:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/hadoop/.ivy2/cache The jars for the packages stored in: /home/hadoop/.ivy2/jars org.apache.hudi#hudi-utilities-bundle_2.12 added as a dependency org.apache.spark#spark-avro_2.11 added as a dependency org.apache.hudi#hudi-spark3-bundle_2.12 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-1341569f-530d-4afe-a08e-cc9ee2167f5c;1.0 confs: [default] found org.apache.hudi#hudi-utilities-bundle_2.12;0.11.1 in central found org.apache.htrace#htrace-core;3.1.0-incubating in central found org.apache.spark#spark-avro_2.11;2.4.4 in central found org.spark-project.spark#unused;1.0.0 in central found org.apache.hudi#hudi-spark3-bundle_2.12;0.11.1 in central :: resolution report :: resolve 257ms :: artifacts dl 13ms :: modules in use: org.apache.htrace#htrace-core;3.1.0-incubating from central in [default] org.apache.hudi#hudi-spark3-bundle_2.12;0.11.1 from central in [default] org.apache.hudi#hudi-utilities-bundle_2.12;0.11.1 from central in [default] org.apache.spark#spark-avro_2.11;2.4.4 from central in [default] org.spark-project.spark#unused;1.0.0 from central in [default]

    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   5   |   0   |   0   |   0   ||   5   |   0   |
    ---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent-1341569f-530d-4afe-a08e-cc9ee2167f5c confs: [default] 0 artifacts copied, 5 already retrieved (0kB/12ms) 2023-06-19T10:26:48.356+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.util.ShutdownHookManager] [ShutdownHookManager]: Adding shutdown hook Main class: org.apache.hudi.utilities.HoodieCompactor Arguments: --table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode scheduleandexecute --spark-memory 2g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=NUM_COMMITS --hoodie-conf hoodie.compact.inline.max.delta.commits=5 Spark config: (spark.serializer,org.apache.spark.serializer.KryoSerializer) (spark.yarn.appMasterEnv.bigdataEnv,bigdata_environment:dev,bigdata_project:tacticalnovusingest,bigdata_environment-type:DEVELOPMENT,bigdata_region:us-east-1,bigdata_servicename:tactical-novus-ingest,bigdata_version:dev4856801) (spark.sql.warehouse.dir,hdfs:///user/spark/warehouse) (spark.yarn.dist.files,file:/etc/hudi/conf.dist/hudi-defaults.conf) (spark.sql.parquet.fs.optimized.committer.optimization-enabled,true) (spark.executorEnv.regionShortName,use1) (spark.executor.extraJavaOptions,-Dcom.amazonaws.sdk.disableCbor=true -Duser.timezone=GMT -verbose:gc -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MetaspaceSize=300M) (spark.history.fs.logDirectory,hdfs:///var/log/spark/apps) (spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem,2) (spark.hadoop.mapreduce.output.fs.optimized.committer.enabled,true) (spark.yarn.appMasterEnv.assetId,a206760) (spark.sql.autoBroadcastJoinThreshold,104857600) (spark.eventLog.enabled,true) (spark.shuffle.service.enabled,false) (spark.driver.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native) (spark.emr.default.executor.memory,18971M) (spark.jars,file:/usr/lib/hudi/hudi-utilities-bundle.jar,file:/usr/lib/hudi/hudi-spark3-bundle_2.12-0.11.0-amzn-0.jar) (spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version,2) (spark.kryoserializer.buffer.max,1024m) (spark.yarn.historyServer.address,ip-100-66-69-75.3175.aws-int.thomsonreuters.com:18080) (spark.stage.attempt.ignoreOnDecommissionFetchFailure,true) (spark.yarn.appMasterEnv.regionFullName,us-east-1) (spark.yarn.appMasterEnv.regionShortName,use1) (spark.app.name,org.apache.hudi.utilities.HoodieCompactor) (spark.storage.decommission.shuffleBlocks.enabled,true) (spark.executorEnv.regionFullName,us-east-1) (spark.rpc.askTimeout,480) (spark.sql.streaming.metricsEnabled,true) (spark.locality.wait,6s) (spark.driver.memory,4g) (spark.executor.instances,8) (spark.decommission.enabled,true) (spark.files.fetchFailure.unRegisterOutputOnHost,true) (spark.submit.pyFiles,) (spark.executorEnv.assetId,a206760) (spark.executor.defaultJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Dfile.encoding=UTF-8) (spark.resourceManager.cleanupExpiredHost,true) (spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS,$(hostname -f)) (spark.sql.emr.internal.extensions,com.amazonaws.emr.spark.EmrSparkSessionExtensions) (spark.emr.default.executor.cores,4) (spark.driver.extraJavaOptions,-Dcom.amazonaws.sdk.disableCbor=true -Duser.timezone=GMT -verbose:gc -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MetaspaceSize=300M) (spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds,2000) (spark.submit.deployMode,client) (spark.deploy.mode,cluster) (spark.master,yarn) (spark.sql.parquet.output.committer.class,com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter) (spark.rpc.message.maxSize,416) (spark.driver.defaultJavaOptions,-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Dfile.encoding=UTF-8) (spark.executorEnv.correlationId,offline_compaction_schedule) (spark.blacklist.decommissioning.timeout,1h) (spark.executor.extraLibraryPath,/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native) (spark.sql.hive.metastore.sharedPrefixes,com.amazonaws.services.dynamodbv2) (spark.executor.memory,16g) (spark.driver.extraClassPath,/usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar) (spark.eventLog.dir,hdfs:///var/log/spark/apps) (spark.executorEnv.bigdataEnv,bigdata_environment:dev,bigdata_project:tacticalnovusingest,bigdata_environment-type:DEVELOPMENT,bigdata_region:us-east-1,bigdata_servicename:tactical-novus-ingest,bigdata_version:dev4856801) (spark.dynamicAllocation.enabled,false) (spark.executor.extraClassPath,/usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar) (spark.executor.cores,10) (spark.history.ui.port,18080) (spark.repl.local.jars,file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.12-0.11.1.jar,file:///home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.4.jar,file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-spark3-bundle_2.12-0.11.1.jar,file:///home/hadoop/.ivy2/jars/org.apache.htrace_htrace-core-3.1.0-incubating.jar,file:///home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar) (spark.blacklist.decommissioning.enabled,true) (spark.yarn.appMasterEnv.correlationId,offline_compaction_schedule) (spark.decommissioning.timeout.threshold,20) (spark.yarn.heterogeneousExecutors.enabled,false) (spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem,true) (spark.yarn.dist.jars,file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.12-0.11.1.jar,file:///home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.4.jar,file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-spark3-bundle_2.12-0.11.1.jar,file:///home/hadoop/.ivy2/jars/org.apache.htrace_htrace-core-3.1.0-incubating.jar,file:///home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar) (spark.hadoop.yarn.timeline-service.enabled,false) (spark.yarn.executor.memoryOverheadFactor,0.1875) Classpath elements: file:/usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.12-0.11.1.jar file:///home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.4.jar file:///home/hadoop/.ivy2/jars/org.apache.hudi_hudi-spark3-bundle_2.12-0.11.1.jar file:///home/hadoop/.ivy2/jars/org.apache.htrace_htrace-core-3.1.0-incubating.jar file:///home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar

2023-06-19T10:26:48.653+0000 [WARN] [offline_compaction_schedule] [org.apache.spark.util.DependencyUtils] [DependencyUtils]: Local jar /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar does not exist, skipping. 2023-06-19T10:26:48.759+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkContext] [SparkContext]: Running Spark version 3.2.1-amzn-0 2023-06-19T10:26:48.783+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.resource.ResourceUtils] [ResourceUtils]: ============================================================== 2023-06-19T10:26:48.783+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.resource.ResourceUtils] [ResourceUtils]: No custom resources configured for spark.driver. 2023-06-19T10:26:48.784+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.resource.ResourceUtils] [ResourceUtils]: ============================================================== 2023-06-19T10:26:48.784+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkContext] [SparkContext]: Submitted application: compactor-novusdoc 2023-06-19T10:26:48.810+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.resource.ResourceProfile] [ResourceProfile]: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 10, script: , vendor: , memory -> name: memory, amount: 2048, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 2023-06-19T10:26:48.824+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.resource.ResourceProfile] [ResourceProfile]: Limiting resource is cpus at 10 tasks per executor 2023-06-19T10:26:48.826+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.resource.ResourceProfileManager] [ResourceProfileManager]: Added ResourceProfile id: 0 2023-06-19T10:26:48.884+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing view acls to: hadoop 2023-06-19T10:26:48.884+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing modify acls to: hadoop 2023-06-19T10:26:48.884+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing view acls groups to: 2023-06-19T10:26:48.885+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing modify acls groups to: 2023-06-19T10:26:48.885+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 2023-06-19T10:26:48.918+0000 [INFO] [offline_compaction_schedule] [org.apache.hadoop.conf.Configuration.deprecation] [deprecation]: mapred.output.compression.codec is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec 2023-06-19T10:26:48.918+0000 [INFO] [offline_compaction_schedule] [org.apache.hadoop.conf.Configuration.deprecation] [deprecation]: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type 2023-06-19T10:26:48.919+0000 [INFO] [offline_compaction_schedule] [org.apache.hadoop.conf.Configuration.deprecation] [deprecation]: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2023-06-19T10:26:49.159+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.network.server.TransportServer] [TransportServer]: Shuffle server started on port: 35007 2023-06-19T10:26:49.168+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.util.Utils] [Utils]: Successfully started service 'sparkDriver' on port 35007. 2023-06-19T10:26:49.177+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.SparkEnv] [SparkEnv]: Using serializer: class org.apache.spark.serializer.KryoSerializer 2023-06-19T10:26:49.196+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkEnv] [SparkEnv]: Registering MapOutputTracker 2023-06-19T10:26:49.197+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.MapOutputTrackerMasterEndpoint] [MapOutputTrackerMasterEndpoint]: init 2023-06-19T10:26:49.235+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkEnv] [SparkEnv]: Registering BlockManagerMaster 2023-06-19T10:26:49.300+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkEnv] [SparkEnv]: Registering BlockManagerMasterHeartbeat 2023-06-19T10:26:49.400+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkEnv] [SparkEnv]: Registering OutputCommitCoordinator 2023-06-19T10:26:49.404+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.subresultcache.SubResultCacheManager] [SubResultCacheManager]: Sub-result caches config to enable false. 2023-06-19T10:26:49.404+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.subresultcache.SubResultCacheManager] [SubResultCacheManager]: Sub-result caches are disabled. 2023-06-19T10:26:49.423+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Created SSL options for ui: SSLOptions{enabled=false, port=None, keyStore=None, keyStorePassword=None, trustStore=None, trustStorePassword=None, protocol=None, enabledAlgorithms=Set()} 2023-06-19T10:26:49.504+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.util.log] [log]: Logging initialized @2813ms to org.sparkproject.jetty.util.log.Slf4jLog 2023-06-19T10:26:49.581+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.Server] [Server]: jetty-9.4.43.v20210629; built: 2021-06-30T11:07:22.254Z; git: 526006ecfa3af7f1a27ef3a288e2bef7ea9dd7e8; jvm 1.8.0_372-b07 2023-06-19T10:26:49.606+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.Server] [Server]: Started @2915ms 2023-06-19T10:26:49.608+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.ui.JettyUtils] [JettyUtils]: Using requestHeaderSize: 8192 2023-06-19T10:26:49.645+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.AbstractConnector] [AbstractConnector]: Started ServerConnector@34dc85a{HTTP/1.1, (http/1.1)}{0.0.0.0:8090} 2023-06-19T10:26:49.646+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.util.Utils] [Utils]: Successfully started service 'SparkUI' on port 8090. 2023-06-19T10:26:49.671+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@b8a7e43{/jobs,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.674+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@719843e5{/jobs/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.675+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@58112bc4{/jobs/job,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.676+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@2f5c1332{/jobs/job/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.677+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@7cab1508{/stages,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.678+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@258ee7de{/stages/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.679+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@6d171ce0{/stages/stage,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.680+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@6e1d4137{/stages/stage/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.681+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@29a4f594{/stages/pool,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.682+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@5327a06e{/stages/pool/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.683+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@287f7811{/storage,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.684+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@2b556bb2{/storage/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.684+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@17271176{/storage/rdd,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.685+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@2e34384c{/storage/rdd/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.686+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@1f52eb6f{/environment,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.687+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@58294867{/environment/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.688+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@6fc3e1a4{/executors,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.689+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@2d5f7182{/executors/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.690+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@29ea78b1{/executors/threadDump,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.691+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@7baf6acf{/executors/threadDump/json,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.701+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@7b3315a5{/static,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.702+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@629ae7e{/,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.703+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@de88ac6{/api,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.704+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@42fcc7e6{/jobs/job/kill,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.705+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@5da7cee2{/stages/stage/kill,null,AVAILABLE,@Spark} 2023-06-19T10:26:49.707+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.ui.SparkUI] [SparkUI]: Bound SparkUI to 0.0.0.0, and started at http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8090 2023-06-19T10:26:49.729+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkContext] [SparkContext]: Added JAR file:/usr/lib/hudi/hudi-utilities-bundle.jar at spark://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:35007/jars/hudi-utilities-bundle.jar with timestamp 1687170408750 2023-06-19T10:26:49.730+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkContext] [SparkContext]: Added JAR file:/usr/lib/hudi/hudi-spark3-bundle_2.12-0.11.0-amzn-0.jar at spark://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:35007/jars/hudi-spark3-bundle_2.12-0.11.0-amzn-0.jar with timestamp 1687170408750 2023-06-19T10:26:49.849+0000: [GC pause (G1 Evacuation Pause) (young), 0.0244707 secs] [Parallel Time: 11.2 ms, GC Workers: 8] [GC Worker Start (ms): Min: 3159.6, Avg: 3159.7, Max: 3159.7, Diff: 0.1] [Ext Root Scanning (ms): Min: 0.7, Avg: 1.5, Max: 4.4, Diff: 3.7, Sum: 11.7] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.2, Sum: 0.3] [Processed Buffers: Min: 0, Avg: 1.0, Max: 2, Diff: 2, Sum: 8] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3] [Code Root Scanning (ms): Min: 0.0, Avg: 0.5, Max: 1.3, Diff: 1.3, Sum: 4.3] [Object Copy (ms): Min: 6.6, Avg: 8.9, Max: 9.7, Diff: 3.1, Sum: 71.0] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4] [Termination Attempts: Min: 1, Avg: 128.1, Max: 158, Diff: 157, Sum: 1025] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3] [GC Worker Total (ms): Min: 11.0, Avg: 11.0, Max: 11.1, Diff: 0.1, Sum: 88.3] [GC Worker End (ms): Min: 3170.7, Avg: 3170.7, Max: 3170.7, Diff: 0.0] [Code Root Fixup: 0.1 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.2 ms] [Other: 13.0 ms] [Choose CSet: 0.0 ms] [Ref Proc: 12.3 ms] [Ref Enq: 0.1 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.3 ms] [Eden: 292.0M(292.0M)->0.0B(262.0M) Survivors: 5120.0K->35840.0K Heap: 299.2M(496.0M)->37864.7K(496.0M)] [Times: user=0.09 sys=0.01, real=0.02 secs] 2023-06-19T10:26:49.974+0000 [INFO] [offline_compaction_schedule] [org.apache.hadoop.yarn.client.RMProxy] [RMProxy]: Connecting to ResourceManager at ip-100-66-69-75.3175.aws-int.thomsonreuters.com/100.66.69.75:8032 2023-06-19T10:26:50.132+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Requesting a new application from cluster with 2 NodeManagers 2023-06-19T10:26:50.432+0000 [INFO] [offline_compaction_schedule] [org.apache.hadoop.conf.Configuration] [Configuration]: resource-types.xml not found 2023-06-19T10:26:50.432+0000 [INFO] [offline_compaction_schedule] [org.apache.hadoop.yarn.util.resource.ResourceUtils] [ResourceUtils]: Unable to find 'resource-types.xml'. 2023-06-19T10:26:50.445+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Verifying our application has not requested more than the maximum memory capability of the cluster (122880 MB per container) 2023-06-19T10:26:50.445+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Will allocate AM container, with 896 MB memory including 384 MB overhead 2023-06-19T10:26:50.445+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Setting up container launch context for our AM 2023-06-19T10:26:50.446+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Setting up the launch environment for our AM container 2023-06-19T10:26:50.452+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Preparing resources for our AM container 2023-06-19T10:26:50.478+0000 [WARN] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 2023-06-19T10:26:54.119+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/mnt/tmp/spark-94366315-0ad4-4f1a-8051-1c517b83f435/spark_libs4987513252404456461.zip -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/spark_libs4987513252404456461.zip 2023-06-19T10:26:54.546+0000: [GC pause (G1 Evacuation Pause) (young), 0.0166820 secs] [Parallel Time: 11.6 ms, GC Workers: 8] [GC Worker Start (ms): Min: 7856.4, Avg: 7856.7, Max: 7857.8, Diff: 1.4] [Ext Root Scanning (ms): Min: 0.0, Avg: 1.1, Max: 4.5, Diff: 4.5, Sum: 8.5] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.2, Sum: 0.3] [Processed Buffers: Min: 0, Avg: 0.6, Max: 3, Diff: 3, Sum: 5] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3] [Code Root Scanning (ms): Min: 0.0, Avg: 0.7, Max: 1.6, Diff: 1.6, Sum: 5.3] [Object Copy (ms): Min: 7.0, Avg: 9.3, Max: 10.5, Diff: 3.5, Sum: 74.6] [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.5] [Termination Attempts: Min: 1, Avg: 154.9, Max: 198, Diff: 197, Sum: 1239] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3] [GC Worker Total (ms): Min: 10.1, Avg: 11.2, Max: 11.5, Diff: 1.4, Sum: 89.9] [GC Worker End (ms): Min: 7867.9, Avg: 7867.9, Max: 7867.9, Diff: 0.1] [Code Root Fixup: 0.2 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.2 ms] [Other: 4.7 ms] [Choose CSet: 0.0 ms] [Ref Proc: 4.1 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.3 ms] [Eden: 262.0M(262.0M)->0.0B(262.0M) Survivors: 35840.0K->35840.0K Heap: 299.0M(496.0M)->37559.0K(496.0M)] [Times: user=0.09 sys=0.01, real=0.02 secs] 2023-06-19T10:26:55.069+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.12-0.11.1.jar -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.hudi_hudi-utilities-bundle_2.12-0.11.1.jar 2023-06-19T10:26:55.222+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.4.jar -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.spark_spark-avro_2.11-2.4.4.jar 2023-06-19T10:26:55.238+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.hudi_hudi-spark3-bundle_2.12-0.11.1.jar -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.hudi_hudi-spark3-bundle_2.12-0.11.1.jar 2023-06-19T10:26:55.239+0000: [GC pause (G1 Evacuation Pause) (young), 0.0122827 secs] [Parallel Time: 11.0 ms, GC Workers: 8] [GC Worker Start (ms): Min: 8548.8, Avg: 8548.9, Max: 8548.9, Diff: 0.1] [Ext Root Scanning (ms): Min: 0.3, Avg: 0.8, Max: 3.8, Diff: 3.5, Sum: 6.3] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.2, Sum: 0.3] [Processed Buffers: Min: 0, Avg: 0.4, Max: 1, Diff: 1, Sum: 3] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4] [Code Root Scanning (ms): Min: 0.0, Avg: 0.6, Max: 1.2, Diff: 1.2, Sum: 4.8] [Object Copy (ms): Min: 7.0, Avg: 9.3, Max: 10.3, Diff: 3.3, Sum: 74.2] [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.4] [Termination Attempts: Min: 1, Avg: 137.4, Max: 175, Diff: 174, Sum: 1099] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.4] [GC Worker Total (ms): Min: 10.8, Avg: 10.9, Max: 10.9, Diff: 0.1, Sum: 86.9] [GC Worker End (ms): Min: 8559.7, Avg: 8559.7, Max: 8559.8, Diff: 0.1] [Code Root Fixup: 0.1 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.2 ms] [Other: 1.0 ms] [Choose CSet: 0.0 ms] [Ref Proc: 0.5 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.2 ms] [Eden: 262.0M(262.0M)->0.0B(280.0M) Survivors: 35840.0K->17408.0K Heap: 298.7M(496.0M)->19127.0K(496.0M)] [Times: user=0.09 sys=0.00, real=0.01 secs] 2023-06-19T10:26:55.407+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.htrace_htrace-core-3.1.0-incubating.jar -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.htrace_htrace-core-3.1.0-incubating.jar 2023-06-19T10:26:55.426+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/org.spark-project.spark_unused-1.0.0.jar 2023-06-19T10:26:55.438+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/etc/hudi/conf.dist/hudi-defaults.conf -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/hudi-defaults.conf 2023-06-19T10:26:55.858+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Creating an archive with the config files for distribution at /mnt/tmp/spark-94366315-0ad4-4f1a-8051-1c517b83f435/spark_conf7322044392243776097.zip. 2023-06-19T10:26:55.946+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Uploading resource file:/mnt/tmp/spark-94366315-0ad4-4f1a-8051-1c517b83f435/spark_conf7322044392243776097.zip -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047/spark_conf.zip 2023-06-19T10:26:56.009+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: =============================================================================== 2023-06-19T10:26:56.009+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: YARN AM launch context: 2023-06-19T10:26:56.010+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: user class: N/A 2023-06-19T10:26:56.010+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: env: 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: regionShortName -> use1 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: CLASSPATH -> /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/docker/usr/lib/hadoop-lzo/lib/:/docker/usr/lib/hadoop/hadoop-aws.jar:/docker/usr/share/aws/aws-java-sdk/:/docker/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/docker/usr/share/aws/emr/security/conf:/docker/usr/share/aws/emr/security/lib/:/docker/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/docker/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/docker/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/docker/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/aws-sdk-v2/bundle-2.17.282.jar{{PWD}}{{PWD}}/spark_conf{{PWD}}/spark_libs/*{{PWD}}/spark_conf/hadoop_conf 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: correlationId -> offline_compaction_schedule 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: SPARK_YARN_STAGING_DIR -> hdfs://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8020/user/hadoop/.sparkStaging/application_1687146322573_0047 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: SPARK_USER -> hadoop 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: regionFullName -> us-east-1 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: bigdataEnv -> bigdata_environment:dev,bigdata_project:tacticalnovusingest,bigdata_environment-type:DEVELOPMENT,bigdata_region:us-east-1,bigdata_servicename:tactical-novus-ingest,bigdata_version:dev4856801 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: assetId -> a206760 2023-06-19T10:26:56.011+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: SPARK_PUBLIC_DNS -> $(hostname -f) 2023-06-19T10:26:56.012+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: resources: 2023-06-19T10:26:56.063+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: org.apache.hudi_hudi-utilities-bundle_2.12-0.11.1.jar -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.hudi_hudi-utilities-bundle_2.12-0.11.1.jar" } size: 62863152 timestamp: 1687170415216 type: FILE visibility: PRIVATE 2023-06-19T10:26:56.064+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: org.apache.hudi_hudi-spark3-bundle_2.12-0.11.1.jar -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.hudi_hudi-spark3-bundle_2.12-0.11.1.jar" } size: 61591563 timestamp: 1687170415401 type: FILE visibility: PRIVATE 2023-06-19T10:26:56.064+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: hudi-defaults.conf -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/hudi-defaults.conf" } size: 1410 timestamp: 1687170415845 type: FILE visibility: PRIVATE 2023-06-19T10:26:56.064+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: spark_libs -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/spark_libs4987513252404456461.zip" } size: 313860902 timestamp: 1687170415000 type: ARCHIVE visibility: PRIVATE 2023-06-19T10:26:56.064+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: spark_conf -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/spark_conf.zip" } size: 304187 timestamp: 1687170415994 type: ARCHIVE visibility: PRIVATE 2023-06-19T10:26:56.065+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: org.apache.spark_spark-avro_2.11-2.4.4.jar -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.spark_spark-avro_2.11-2.4.4.jar" } size: 187318 timestamp: 1687170415232 type: FILE visibility: PRIVATE 2023-06-19T10:26:56.065+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: org.apache.htrace_htrace-core-3.1.0-incubating.jar -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/org.apache.htrace_htrace-core-3.1.0-incubating.jar" } size: 1475955 timestamp: 1687170415420 type: FILE visibility: PRIVATE 2023-06-19T10:26:56.065+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: org.spark-project.spark_unused-1.0.0.jar -> resource { scheme: "hdfs" host: "ip-100-66-69-75.3175.aws-int.thomsonreuters.com" port: 8020 file: "/user/hadoop/.sparkStaging/application_1687146322573_0047/org.spark-project.spark_unused-1.0.0.jar" } size: 2777 timestamp: 1687170415433 type: FILE visibility: PRIVATE 2023-06-19T10:26:56.065+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: command: 2023-06-19T10:26:56.066+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: {{JAVA_HOME}}/bin/java -server -Xmx512m -Djava.io.tmpdir={{PWD}}/tmp -Dspark.yarn.app.container.log.dir= org.apache.spark.deploy.yarn.ExecutorLauncher --arg 'ip-100-66-69-75.3175.aws-int.thomsonreuters.com:35007' --properties-file {{PWD}}/spark_conf/spark_conf.properties --dist-cache-conf {{PWD}}/spark_conf/spark_dist_cache.properties 1> /stdout 2> /stderr 2023-06-19T10:26:56.066+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: =============================================================================== 2023-06-19T10:26:56.067+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing view acls to: hadoop 2023-06-19T10:26:56.067+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing modify acls to: hadoop 2023-06-19T10:26:56.067+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing view acls groups to: 2023-06-19T10:26:56.067+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: Changing modify acls groups to: 2023-06-19T10:26:56.067+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 2023-06-19T10:26:56.090+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: AM resources: Map() 2023-06-19T10:26:56.091+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: spark.yarn.maxAppAttempts is not set. Cluster's default value will be used. 2023-06-19T10:26:56.092+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Created resource capability for AM request: <memory:896, max memory:9223372036854775807, vCores:1, max vCores:2147483647> 2023-06-19T10:26:56.093+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Submitting application application_1687146322573_0047 to ResourceManager 2023-06-19T10:26:56.124+0000 [INFO] [offline_compaction_schedule] [org.apache.hadoop.yarn.client.api.impl.YarnClientImpl] [YarnClientImpl]: Submitted application application_1687146322573_0047 2023-06-19T10:26:57.127+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:26:57.130+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:26:58.131+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:26:58.131+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:26:59.132+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:26:59.133+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:00.134+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:27:00.134+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:01.135+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:27:01.136+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:02.137+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:27:02.137+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:03.139+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:27:03.139+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:04.140+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:27:04.140+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:05.142+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: ACCEPTED) 2023-06-19T10:27:05.142+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:05.989+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.network.server.TransportServer] [TransportServer]: New connection accepted for remote address /100.66.95.167:57800. 2023-06-19T10:27:06.143+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: Application report for application_1687146322573_0047 (state: RUNNING) 2023-06-19T10:27:06.143+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.deploy.yarn.Client] [Client]: client token: N/A diagnostics: N/A ApplicationMaster host: 100.66.95.167 ApplicationMaster RPC port: -1 queue: default start time: 1687170416103 final status: UNDEFINED tracking URL: http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:20888/proxy/application_1687146322573_0047/ user: hadoop 2023-06-19T10:27:06.152+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.network.server.TransportServer] [TransportServer]: Shuffle server started on port: 32849 2023-06-19T10:27:06.152+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.util.Utils] [Utils]: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32849. 2023-06-19T10:27:06.152+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.network.netty.NettyBlockTransferService] [NettyBlockTransferService]: Server created on ip-100-66-69-75.3175.aws-int.thomsonreuters.com:32849 2023-06-19T10:27:06.301+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.ui.ServerInfo] [ServerInfo]: Adding filter to /metrics/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 2023-06-19T10:27:06.303+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.handler.ContextHandler] [ContextHandler]: Started o.s.j.s.ServletContextHandler@5c134052{/metrics/json,null,AVAILABLE,@Spark} 2023-06-19T10:27:06.323+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.deploy.history.SingleEventLogFileWriter] [SingleEventLogFileWriter]: Logging events to hdfs:/var/log/spark/apps/application_1687146322573_0047.inprogress 2023-06-19T10:27:06.519+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.SparkContext] [SparkContext]: Adding shutdown hook 2023-06-19T10:27:06.553+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableMetaClient] [HoodieTableMetaClient]: Loading HoodieTableMetaClient from s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-19T10:27:06.772+0000: [GC pause (G1 Evacuation Pause) (young), 0.0194145 secs] [Parallel Time: 13.6 ms, GC Workers: 8] [GC Worker Start (ms): Min: 20082.0, Avg: 20082.1, Max: 20082.2, Diff: 0.1] [Ext Root Scanning (ms): Min: 0.5, Avg: 1.1, Max: 5.2, Diff: 4.7, Sum: 9.0] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.2, Sum: 0.3] [Processed Buffers: Min: 0, Avg: 0.4, Max: 1, Diff: 1, Sum: 3] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3] [Code Root Scanning (ms): Min: 0.0, Avg: 0.7, Max: 1.5, Diff: 1.5, Sum: 5.3] [Object Copy (ms): Min: 8.3, Avg: 11.5, Max: 12.8, Diff: 4.4, Sum: 91.9] [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.8] [Termination Attempts: Min: 1, Avg: 243.8, Max: 313, Diff: 312, Sum: 1950] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4] [GC Worker Total (ms): Min: 13.4, Avg: 13.5, Max: 13.6, Diff: 0.2, Sum: 108.1] [GC Worker End (ms): Min: 20095.6, Avg: 20095.6, Max: 20095.6, Diff: 0.1] [Code Root Fixup: 0.2 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.2 ms] [Other: 5.4 ms] [Choose CSet: 0.0 ms] [Ref Proc: 4.8 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.3 ms] [Eden: 280.0M(280.0M)->0.0B(262.0M) Survivors: 17408.0K->35840.0K Heap: 298.7M(496.0M)->37559.0K(496.0M)] [Times: user=0.10 sys=0.00, real=0.02 secs] 2023-06-19T10:27:07.272+0000 [INFO] [offline_compaction_schedule] [com.amazon.ws.emr.hadoop.fs.util.ClientConfigurationFactory] [ClientConfigurationFactory]: Set initial getObject socket timeout to 2000 ms. 2023-06-19T10:27:07.546+0000: [GC pause (G1 Evacuation Pause) (young), 0.0190510 secs] [Parallel Time: 15.1 ms, GC Workers: 8] [GC Worker Start (ms): Min: 20856.5, Avg: 20856.5, Max: 20856.6, Diff: 0.1] [Ext Root Scanning (ms): Min: 0.3, Avg: 1.1, Max: 5.7, Diff: 5.4, Sum: 9.1] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.2, Sum: 0.3] [Processed Buffers: Min: 0, Avg: 0.4, Max: 1, Diff: 1, Sum: 3] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.3] [Code Root Scanning (ms): Min: 0.0, Avg: 0.9, Max: 2.1, Diff: 2.1, Sum: 7.6] [Object Copy (ms): Min: 9.2, Avg: 12.7, Max: 14.2, Diff: 5.0, Sum: 101.5] [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.6] [Termination Attempts: Min: 1, Avg: 191.4, Max: 236, Diff: 235, Sum: 1531] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4] [GC Worker Total (ms): Min: 14.9, Avg: 15.0, Max: 15.0, Diff: 0.1, Sum: 119.7] [GC Worker End (ms): Min: 20871.5, Avg: 20871.5, Max: 20871.5, Diff: 0.1] [Code Root Fixup: 0.2 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.2 ms] [Other: 3.6 ms] [Choose CSet: 0.0 ms] [Ref Proc: 3.0 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.3 ms] [Eden: 262.0M(262.0M)->0.0B(266.0M) Survivors: 35840.0K->31744.0K Heap: 298.7M(496.0M)->33975.0K(496.0M)] [Times: user=0.12 sys=0.00, real=0.02 secs] 2023-06-19T10:27:08.214+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableConfig] [HoodieTableConfig]: Loading table properties from s3://a206760-novusdoc-s3-dev-use1/novusdoc/.hoodie/hoodie.properties 2023-06-19T10:27:08.231+0000 [INFO] [offline_compaction_schedule] [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem] [S3NativeFileSystem]: Opening 's3://a206760-novusdoc-s3-dev-use1/novusdoc/.hoodie/hoodie.properties' for reading 2023-06-19T10:27:08.367+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableMetaClient] [HoodieTableMetaClient]: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-19T10:27:08.367+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableMetaClient] [HoodieTableMetaClient]: Loading Active commit timeline for s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-19T10:27:08.460+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.timeline.HoodieActiveTimeline] [HoodieActiveTimeline]: Loaded instants upto : Option{val=[20230619102516597deltacommitCOMPLETED]} 2023-06-19T10:27:08.473+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.utilities.HoodieCompactor] [HoodieCompactor]: HoodieCompactorConfig { --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc, --table-name novusdoc, --instant-time null, --parallelism 200, --schema-file null, --spark-master null, --spark-memory 2g, --retry 0, --schedule false, --mode scheduleandexecute, --strategy org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy, --props null, --hoodie-conf [hoodie.metadata.enable=false, hoodie.compact.inline.trigger.strategy=NUM_COMMITS, hoodie.compact.inline.max.delta.commits=5] } 2023-06-19T10:27:08.474+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.utilities.HoodieCompactor] [HoodieCompactor]: Running Mode: [scheduleandexecute] 2023-06-19T10:27:08.474+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.utilities.HoodieCompactor] [HoodieCompactor]: Step 1: Do schedule 2023-06-19T10:27:08.651+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.embedded.EmbeddedTimelineService] [EmbeddedTimelineService]: Starting Timeline service !! 2023-06-19T10:27:08.652+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.embedded.EmbeddedTimelineService] [EmbeddedTimelineService]: Overriding hostIp to (ip-100-66-69-75.3175.aws-int.thomsonreuters.com) found in spark-conf. It was null 2023-06-19T10:27:08.661+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.view.FileSystemViewManager] [FileSystemViewManager]: Creating View Manager with storage type :MEMORY 2023-06-19T10:27:08.661+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.view.FileSystemViewManager] [FileSystemViewManager]: Creating in-memory based Table View 2023-06-19T10:27:08.671+0000 [DEBUG] [offline_compaction_schedule] [org.apache.hudi.org.eclipse.jetty.util.log] [log]: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.apache.hudi.org.eclipse.jetty.util.log) via org.apache.hudi.org.eclipse.jetty.util.log.Slf4jLog 2023-06-19T10:27:08.672+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.org.eclipse.jetty.util.log] [log]: Logging initialized @21982ms to org.apache.hudi.org.eclipse.jetty.util.log.Slf4jLog 2023-06-19T10:27:08.723+0000 [DEBUG] [offline_compaction_schedule] [org.apache.hudi.timeline.service.handlers.MarkerHandler] [MarkerHandler]: MarkerHandler FileSystem: s3 2023-06-19T10:27:08.723+0000 [DEBUG] [offline_compaction_schedule] [org.apache.hudi.timeline.service.handlers.MarkerHandler] [MarkerHandler]: MarkerHandler batching params: batchNumThreads=20 batchIntervalMs=50ms 2023-06-19T10:27:08.767+0000: [GC pause (G1 Evacuation Pause) (young), 0.0285504 secs] [Parallel Time: 21.1 ms, GC Workers: 8] [GC Worker Start (ms): Min: 22077.6, Avg: 22077.7, Max: 22078.6, Diff: 1.0] [Ext Root Scanning (ms): Min: 0.2, Avg: 1.7, Max: 7.8, Diff: 7.6, Sum: 13.3] [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.2, Sum: 0.3] [Processed Buffers: Min: 0, Avg: 0.4, Max: 1, Diff: 1, Sum: 3] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3] [Code Root Scanning (ms): Min: 0.0, Avg: 1.0, Max: 2.2, Diff: 2.2, Sum: 7.9] [Object Copy (ms): Min: 13.1, Avg: 18.0, Max: 19.9, Diff: 6.8, Sum: 143.7] [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.8] [Termination Attempts: Min: 1, Avg: 229.8, Max: 299, Diff: 298, Sum: 1838] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3] [GC Worker Total (ms): Min: 20.0, Avg: 20.8, Max: 21.0, Diff: 1.0, Sum: 166.5] [GC Worker End (ms): Min: 22098.5, Avg: 22098.6, Max: 22098.6, Diff: 0.0] [Code Root Fixup: 0.3 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.2 ms] [Other: 7.0 ms] [Choose CSet: 0.0 ms] [Ref Proc: 6.3 ms] [Ref Enq: 0.1 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 0.3 ms] [Eden: 266.0M(266.0M)->0.0B(259.0M) Survivors: 31744.0K->38912.0K Heap: 299.2M(496.0M)->44866.0K(496.0M)] [Times: user=0.16 sys=0.01, real=0.04 secs] 2023-06-19T10:27:08.818+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]:


      / /____ _ _   __ ____ _ / /(_)____
 __  / // __ `/| | / // __ `// // // __ \
/ /_/ // /_/ / | |/ // /_/ // // // / / /
\____/ \__,_/  |___/ \__,_//_//_//_/ /_/

    https://javalin.io/documentation

2023-06-19T10:27:08.819+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]: Starting Javalin ... 2023-06-19T10:27:08.957+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]: Listening on http://localhost:42997/ 2023-06-19T10:27:08.957+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]: Javalin started in 142ms \o/ 2023-06-19T10:27:08.957+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.timeline.service.TimelineService] [TimelineService]: Starting Timeline server on port :42997 2023-06-19T10:27:08.957+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.embedded.EmbeddedTimelineService] [EmbeddedTimelineService]: Started embedded timeline server at ip-100-66-69-75.3175.aws-int.thomsonreuters.com:42997 2023-06-19T10:27:08.970+0000 [WARN] [offline_compaction_schedule] [org.apache.hudi.utilities.HoodieCompactor] [HoodieCompactor]: No instant time is provided for scheduling compaction. 2023-06-19T10:27:08.973+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.BaseHoodieWriteClient] [BaseHoodieWriteClient]: Scheduling table service COMPACT 2023-06-19T10:27:08.974+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.BaseHoodieWriteClient] [BaseHoodieWriteClient]: Scheduling compaction at instant time :20230619102708972 2023-06-19T10:27:08.978+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableMetaClient] [HoodieTableMetaClient]: Loading HoodieTableMetaClient from s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-19T10:27:08.990+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableConfig] [HoodieTableConfig]: Loading table properties from s3://a206760-novusdoc-s3-dev-use1/novusdoc/.hoodie/hoodie.properties 2023-06-19T10:27:08.990+0000 [INFO] [offline_compaction_schedule] [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem] [S3NativeFileSystem]: Opening 's3://a206760-novusdoc-s3-dev-use1/novusdoc/.hoodie/hoodie.properties' for reading 2023-06-19T10:27:09.067+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableMetaClient] [HoodieTableMetaClient]: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-19T10:27:09.068+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.HoodieTableMetaClient] [HoodieTableMetaClient]: Loading Active commit timeline for s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-19T10:27:09.070+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: user=dr.who aclsEnabled=false viewAcls=hadoop viewAclsGroups= 2023-06-19T10:27:09.113+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.timeline.HoodieActiveTimeline] [HoodieActiveTimeline]: Loaded instants upto : Option{val=[20230619102516597deltacommitCOMPLETED]} 2023-06-19T10:27:09.121+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.view.FileSystemViewManager] [FileSystemViewManager]: Creating View Manager with storage type :REMOTE_FIRST 2023-06-19T10:27:09.121+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.common.table.view.FileSystemViewManager] [FileSystemViewManager]: Creating remote first table view 2023-06-19T10:27:09.128+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.table.action.compact.ScheduleCompactionActionExecutor] [ScheduleCompactionActionExecutor]: Checking if compaction needs to be run on s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-19T10:27:09.137+0000 [DEBUG] [offline_compaction_schedule] [org.apache.spark.SecurityManager] [SecurityManager]: user=dr.who aclsEnabled=false viewAcls=hadoop viewAclsGroups= 2023-06-19T10:27:09.184+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.BaseHoodieClient] [BaseHoodieClient]: Stopping Timeline service !! 2023-06-19T10:27:09.184+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.embedded.EmbeddedTimelineService] [EmbeddedTimelineService]: Closing Timeline server 2023-06-19T10:27:09.184+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.timeline.service.TimelineService] [TimelineService]: Closing Timeline Service 2023-06-19T10:27:09.184+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]: Stopping Javalin ... 2023-06-19T10:27:09.195+0000 [INFO] [offline_compaction_schedule] [io.javalin.Javalin] [Javalin]: Javalin has stopped 2023-06-19T10:27:09.195+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.timeline.service.TimelineService] [TimelineService]: Closed Timeline Service 2023-06-19T10:27:09.195+0000 [INFO] [offline_compaction_schedule] [org.apache.hudi.client.embedded.EmbeddedTimelineService] [EmbeddedTimelineService]: Closed Timeline server 2023-06-19T10:27:09.196+0000 [WARN] [offline_compaction_schedule] [org.apache.hudi.utilities.HoodieCompactor] [HoodieCompactor]: Couldn't do schedule 2023-06-19T10:27:09.211+0000 [INFO] [offline_compaction_schedule] [org.sparkproject.jetty.server.AbstractConnector] [AbstractConnector]: Stopped Spark@34dc85a{HTTP/1.1, (http/1.1)}{0.0.0.0:8090} 2023-06-19T10:27:09.238+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.ui.SparkUI] [SparkUI]: Stopped Spark web UI at http://ip-100-66-69-75.3175.aws-int.thomsonreuters.com:8090 2023-06-19T10:27:09.708+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.MapOutputTrackerMasterEndpoint] [MapOutputTrackerMasterEndpoint]: MapOutputTrackerMasterEndpoint stopped! 2023-06-19T10:27:09.749+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.SparkContext] [SparkContext]: Successfully stopped SparkContext 2023-06-19T10:27:09.751+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.util.ShutdownHookManager] [ShutdownHookManager]: Shutdown hook called 2023-06-19T10:27:09.751+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.util.ShutdownHookManager] [ShutdownHookManager]: Deleting directory /mnt/tmp/spark-94366315-0ad4-4f1a-8051-1c517b83f435 2023-06-19T10:27:09.756+0000 [INFO] [offline_compaction_schedule] [org.apache.spark.util.ShutdownHookManager] [ShutdownHookManager]: Deleting directory /mnt/tmp/spark-f72ca80c-54af-4f64-bcaa-176fe9cc27e4 Heap garbage-first heap total 507904K, used 192322K [0x00000006c0000000, 0x00000006c0100f80, 0x00000007c0000000) region size 1024K, 183 young (187392K), 38 survivors (38912K) Metaspace used 102404K, capacity 108290K, committed 108544K, reserved 1144832K class space used 13406K, capacity 14036K, committed 14080K, reserved 1048576K [hadoop@ip-100-66-69-75 a206760-PowerUser2

koochiswathiTR commented 1 year ago

@ad1happy2go @soumilshah1995 I see only first time the compaction triggered, from second time it says NO instant time found. But I dont want to provide instant type as its optional and I dont want to give in my case.

koochiswathiTR commented 1 year ago

Looks like needCompact function is not considering hoodie.compact.inline.trigger.strategy=NUM_COMMITS as compaction strategy. And its returning as false. and compaction is not scheduling. Please help

image

koochiswathiTR commented 1 year ago

@xushiyan @soumilshah1995 @ad1happy2go @nsivabalan

Any update on this?

ad1happy2go commented 1 year ago

@koochiswathiTR For NUM_COMMITS - below is the code it's using to analyse if it needs compaction or not. I see your hoodie.compact.inline.max.delta.commits property is 100. it will schedule after 100 commits after last compaction only.

case NUM_COMMITS:
        compactable = inlineCompactDeltaCommitMax <= latestDeltaCommitInfo.getLeft();
        if (compactable) {
          LOG.info(String.format("The delta commits >= %s, trigger compaction scheduler.", inlineCompactDeltaCommitMax));
        }
        break;

Let me know in case I misunderstood your doubt.

koochiswathiTR commented 1 year ago

@ad1happy2go We do have more than 200 delta commits, Sometimes we dont see compaction is getting triggered.

I see the compaction went to inflight when will this get complete, How to complete this inprogress compaction

image

koochiswathiTR commented 1 year ago

@ad1happy2go

Compactions fail with

java.lang.IllegalArgumentException: Earliest write inflight instant time must be later than compaction time. Earliest :[==>20230620080309158deltacommitINFLIGHT], Compaction scheduled at 20230620080355689

2023-06-20 08:03:55,711 INFO s3n.S3NativeFileSystem: Opening 's3://a206760-novusdoc-s3-dev-use1/novusdoc/.hoodie/hoodie.properties' for reading 2023-06-20 08:03:55,741 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-20 08:03:55,741 INFO table.HoodieTableMetaClient: Loading Active commit timeline for s3://a206760-novusdoc-s3-dev-use1/novusdoc

I deleted 20230620080309158.deltacommit.inflight which and 20230620080309158.deltacommited.reqeusted and it worked. But I cant do this in production, We get evry second upserts through stream. Please help

spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 --verbose --driver-memory 2g --executor-memory 2g --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar --table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode scheduleandexecute --spark-memory 2g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=NUM_COMMITS --hoodie-conf hoodie.compact.inline.max.delta.commits=50

koochiswathiTR commented 1 year ago

First time compaction runs From second time I get this error 2023-06-20T10:31:04.313+0000 [WARN] [offline_compaction_scheduleTest] [org.apache.hudi.utilities.HoodieCompactor] [HoodieCompactor]: Couldn't do schedule 2023-06-20T10:31:04.323+0000 [INFO] [offline_compaction_scheduleTest] [org.sparkproject.jetty.server.AbstractConnector] [AbstractConnector]: Stopped Spark@99a78d7{HTTP/1.1, (http/1.1)}{0.0.0.0:8090}

image

koochiswathiTR commented 1 year ago

I scheduled and executed compaction, but execute command failed with Out Of Memory and compaction went to INFLIGHT in hudi timeline, how to complete this INFLIGHT compaction without bringing down the ingestion job. Can we achieve the unscheduling of INFLIGHT compaction with spark submit? image

spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 --verbose --driver-memory 6g --executor-memory 6g --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar --table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode schedlueandexecute --spark-memory 6g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=TIME_ELAPSED --hoodie-conf hoodie.compact.inline.max.delta.seconds=3600

@ad1happy2go @soumilshah1995 @xushiyan @nsivabalan spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 --verbose --driver-memory 8g --executor-memory 8g --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar --table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode execute --spark-memory 8g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=TIME_ELAPSED --hoodie-conf hoodie.compact.inline.max.delta.seconds=3600

ad1happy2go commented 1 year ago

@koochiswathiTR I dont think there is something like that which unschedule the compaction.

itdom commented 1 month ago

@ad1happy2go

Compactions fail with

java.lang.IllegalArgumentException: Earliest write inflight instant time must be later than compaction time. Earliest :[==>20230620080309158deltacommitINFLIGHT], Compaction scheduled at 20230620080355689

2023-06-20 08:03:55,711 INFO s3n.S3NativeFileSystem: Opening 's3://a206760-novusdoc-s3-dev-use1/novusdoc/.hoodie/hoodie.properties' for reading 2023-06-20 08:03:55,741 INFO table.HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from s3://a206760-novusdoc-s3-dev-use1/novusdoc 2023-06-20 08:03:55,741 INFO table.HoodieTableMetaClient: Loading Active commit timeline for s3://a206760-novusdoc-s3-dev-use1/novusdoc

I deleted 20230620080309158.deltacommit.inflight which and 20230620080309158.deltacommited.reqeusted and it worked. But I cant do this in production, We get evry second upserts through stream. Please help

spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.12:0.11.1,org.apache.spark:spark-avro_2.11:2.4.4,org.apache.hudi:hudi-spark3-bundle_2.12:0.11.1 --verbose --driver-memory 2g --executor-memory 2g --class org.apache.hudi.utilities.HoodieCompactor /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar --table-name novusdoc --base-path s3://a206760-novusdoc-s3-dev-use1/novusdoc --mode scheduleandexecute --spark-memory 2g --hoodie-conf hoodie.metadata.enable=false --hoodie-conf hoodie.compact.inline.trigger.strategy=NUM_COMMITS --hoodie-conf hoodie.compact.inline.max.delta.commits=50

Has your problem been resolved