apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.13k stars 842 forks source link

[Bug] java.lang.RuntimeException: Failed to find latest snapshot id #3678

Open skdfeitian opened 2 days ago

skdfeitian commented 2 days ago

Search before asking

Paimon version

0.8

Compute Engine

flink 1.17.1

Minimal reproduce step

java.lang.RuntimeException: Failed to find latest snapshot id at org.apache.paimon.utils.SnapshotManager.latestSnapshotId(SnapshotManager.java:143) at org.apache.paimon.utils.SnapshotManager.latestSnapshot(SnapshotManager.java:131) at org.apache.paimon.operation.FileStoreCommitImpl.tryCommit(FileStoreCommitImpl.java:623) at org.apache.paimon.operation.FileStoreCommitImpl.commit(FileStoreCommitImpl.java:294) at org.apache.paimon.table.sink.TableCommitImpl.commitMultiple(TableCommitImpl.java:192) at org.apache.paimon.flink.sink.StoreCommitter.commit(StoreCommitter.java:100) at org.apache.paimon.flink.sink.CommitterOperator.commitUpToCheckpoint(CommitterOperator.java:159) at org.apache.paimon.flink.sink.CommitterOperator.notifyCheckpointComplete(CommitterOperator.java:152) at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.notifyCheckpointComplete(StreamOperatorWrapper.java:104) at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.notifyCheckpointComplete(RegularOperatorChain.java:145) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpoint(SubtaskCheckpointCoordinatorImpl.java:467) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointComplete(SubtaskCheckpointCoordinatorImpl.java:400) at org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:1430) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointCompleteAsync$16(StreamTask.java:1371) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$19(StreamTask.java:1410) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:494) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1737) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1753) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1750) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1765) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) at org.apache.paimon.fs.hadoop.HadoopFileIO.exists(HadoopFileIO.java:110) at org.apache.paimon.utils.SnapshotManager.findLatest(SnapshotManager.java:391) at org.apache.paimon.utils.SnapshotManager.latestSnapshotId(SnapshotManager.java:141) ... 27 more

What doesn't meet your expectations?

Consuming Kafka data and writing it to HDFS files via Flink SQL. Occasionally, for some high-traffic topics, this error occurs. Please provide guidance on how to troubleshoot and resolve this issue.

Anything else?

No response

Are you willing to submit a PR?

xuzifu666 commented 20 minutes ago

@skdfeitian From the error stack,this issue due to FileSystem cache expired. Maybe you can set fs.hdfs.impl.disable.cache=true in client or hdfs conf to reslove the problem.