Open szingerpeter opened 1 year ago
@szingerpeter Thanks for reporting the issue. On Hudi 0.11.0 release, the metadata table is enabled by default. After you deleted the s3://<table_path>/.hoodie/metadata
, the next write operation doing upserts should reinitialize the metadata table automatically, with all files in the table, not just the ones from the last upsert operation.
To help triage the issue, could you share the write configs for the upsert job, and the active timeline of the metadata table (under s3://<table_path>/.hoodie/metadata/.hoodie
)?
Based on the information, it looks like you're using EMR 6.7.0 release. Could you try EMR 6.8.0 release which has Hudi 0.11.1 release and see if you hit the same problem?
@yihua , thank you for your quick reply.
Unfortunately, the environment is fixed and upgrading EMR is not possible right now.
i sent the requested files via slack
feel free to close the issue if you get it resolved and not looking for any more AIs from us.
i'm still in touch with @yihua
@szingerpeter @yihua what is the latest state of this issue?
I'm experiencing a similar situation. I did upgrade my tables to EMR 6.9 with hudi 0.12, my pipelines broke, so I downgraded the tables back to EMR 6.5 and hudi 0.9. After that I'm seeing that even with the metadata table config enabled I'm not able to see them on s3. I've tried to do the following:
sudo /usr/lib/hudi/cli/bin/hudi-cli.sh
connect --path <S3-PATH>
metadata create
The command fails but it seems to create an empty metadata table, this is the stack trace
2023-01-10 18:47:36,061 INFO scheduler.DAGScheduler: ResultStage 0 (collect at HoodieSparkEngineContext.java:73) failed in 0.607 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (ip-172-31-2-164.us-west-2.compute.internal executor 1): java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2934)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1704)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:457)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
2023-01-10 18:47:36,064 INFO scheduler.DAGScheduler: Job 0 failed: collect at HoodieSparkEngineContext.java:73, took 0.652691 s
2023-01-10 18:47:36,065 ERROR core.SimpleExecutionStrategy: Command failed java.lang.reflect.UndeclaredThrowableException
2023-01-10 18:47:36,066 WARN JLineShellComponent.exceptions:
java.lang.reflect.UndeclaredThrowableException
at org.springframework.util.ReflectionUtils.rethrowRuntimeException(ReflectionUtils.java:315)
at org.springframework.util.ReflectionUtils.handleInvocationTargetException(ReflectionUtils.java:295)
at org.springframework.util.ReflectionUtils.handleReflectionException(ReflectionUtils.java:279)
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:219)
at org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
at org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
at org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533)
at org.springframework.shell.core.JLineShell.run(JLineShell.java:179)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (ip-172-31-2-164.us-west-2.compute.internal executor 1): java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2934)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1704)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:457)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2470)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2419)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2418)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2418)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1125)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1125)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1125)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2684)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2626)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2615)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2262)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2281)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2306)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362)
at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:73)
at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.getPartitionsToFilesMapping(HoodieBackedTableMetadataWriter.java:365)
at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.bootstrapFromFilesystem(HoodieBackedTableMetadataWriter.java:313)
at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.bootstrapIfNeeded(HoodieBackedTableMetadataWriter.java:272)
at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.initialize(SparkHoodieBackedTableMetadataWriter.java:91)
at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.<init>(HoodieBackedTableMetadataWriter.java:114)
at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.<init>(SparkHoodieBackedTableMetadataWriter.java:62)
at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:58)
at org.apache.hudi.cli.commands.MetadataCommand.create(MetadataCommand.java:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
... 6 more
Caused by: java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2934)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1704)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:457)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Does anyone have a clue on why this is happening?
Recommended way to delete metadata table for hudi versions > 0.11.0, disable metadata via write configs(hoodie.metadata.enable=false) in next write to hudi and hudi will programmatically take care of deleting the metadata table and remove all references (like hoodie.properties).
After disabling, if you wish to re-instantiate metadata table, after few commits, you can re-enable it back via write configs (hoodie.metadata.enable=true) and hudi will take care of populating the metadata table from scratch for you.
If you prefer async way of building metadata table so as not to block your regular writers, you can try our async indexer as well. More info can be found here.
Having said all this, curious to know what issues you folks encountered that demanded deleting metadata table. Would you help going over the details.
During downgrade, depending on versions, hudi will automatically delete the metadata table. For eg, if you downgrade from 0.12.0 to 0.9.0, hudi will delete the metadata table since the way we populate metadata table differs across these versions and hence.
one difference you might see pre 0.11.0 and after 0.11.0 is, by default metadata table is not enabled pre 0.11.0. something to keep in mind if you were relying in default configs to build metadata table.
Hey @nsivabalan, thanks for your answer.
Having said all this, curious to know what issues you folks encountered that demanded deleting metadata table. Would you help going over the details.
In my case it was exactly this:
During downgrade, depending on versions, hudi will automatically delete the metadata table. For eg, if you downgrade from 0.12.0 to 0.9.0, hudi will delete the metadata table since the way we populate metadata table differs across these versions and hence.
That was my initial assumption, it's nice to confirm that was the case.
one difference you might see pre 0.11.0 and after 0.11.0 is, by default metadata table is not enabled pre 0.11.0. something to keep in mind if you were relying in default configs to build metadata table.
I'm not relying on defaults in this case, I have the config hoodie.metadata.enable=true
.
My metadata table was deleted in the downgrade but the writer was kept with the config enabled, after that we were writing the data for more than 1 week and the table wasn't recreated, do you think disabling and enabling after some days will fix my issue? (I'm already back to 0.9)
yes, that should work. On a side note, In general, we have fixed some stability issues w/ metadata table in 0.11.0. So, something to be cautious in older versions. But if it had been working well for you, please continue using the metadata table.
Thank you @nsivabalan, I'm gonna try that! 🙏🏾
Just to let you know @nsivabalan, your recommendation didn't work. I disabled (hoodie.metadata.enable=false
) for 2 days and enabled after that (hoodie.metadata.enable=true
) and the metadata table wasn't re-created.
sorry for the late reply. @yihua had the same recommendation for me, which resulted in hudi deleting all the data.
my case was somewhat more specific as by mistake during writes partitioning config hasn't been defined (what was defined previously).
i shared all info with @yihua on slack.
@yihua : do we have a jira for this. Is there any fix required. Or if you feel its a user mis-configuration and no fixes required from hudi side, we can close out the issue.
Hy,
I'm using Hudi CLI version 1.0; hudi version 0.11.0; Spark version 3.2.1-amzn-0 and Hive version 3.1.3-amzn-0. After rolling back a table I was facing the issue described in #4747
Thereafter, following the recommendation on #4747, I deleted manually the metadata folder under
s3://<table_path>/.hoodie/metadata
, which solved the problem.After upserting into the table, the metadata
s3://<table_path>/.hoodie/metadata
gets recreated. However, after querying the data via spark and beeline, it only returns the entries, which have been upserted in the last operation (~40M rows) and not any previous data (~2B rows). If i deletes3://<table_path>/.hoodie/metadata
again, then both spark and beeline returns all the historical data and the newly inserted data.I tried using hudi cli's
metadata create
command, but it fails with:is there a way of recreating the metadata table of an existing hudi table such that it will reference historical data as well?