apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.4k stars 2.21k forks source link

Flink: Fix flaky TestIcebergSourceFailover > testBoundedWithSavepoint #10671

Open nastra opened 3 months ago

nastra commented 3 months ago

Feature Request / Improvement

TestIcebergSourceFailover > testBoundedWithSavepoint FAILED
    java.util.concurrent.ExecutionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint was declined (tasks not ready)
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2005)
        at org.apache.iceberg.flink.source.TestIcebergSourceFailover.testBoundedWithSavepoint(TestIcebergSourceFailover.java:154)

        Caused by:
        org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint was declined (tasks not ready)
            at app//org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:550)
            at app//org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:[214](https://github.com/apache/iceberg/actions/runs/9867945567/job/27249123973?pr=10663#step:6:215)3)
            at app//org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1100)
            at app//org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103)
            at app//org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
            at java.base@11.0.23/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base@11.0.23/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base@11.0.23/java.lang.Thread.run(Thread.java:829)

            Caused by:
            org.apache.flink.runtime.checkpoint.CheckpointException: org.apache.flink.runtime.checkpoint.CheckpointException: Task name with subtask : Source: IcebergSource -> Map -> IcebergStreamWriter (1/2)#0 Failure reason: Checkpoint was declined (tasks not ready)
                at app//org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1396)
                at app//org.apache.flink.runtime.taskmanager.Task.declineCheckpoint(Task.java:1389)
                at app//org.apache.flink.runtime.taskmanager.Task.triggerCheckpointBarrier(Task.java:1383)
                at app//org.apache.flink.runtime.taskexecutor.TaskExecutor.triggerCheckpoint(TaskExecutor.java:1023)
                at jdk.internal.reflect.GeneratedMethodAccessor244.invoke(Unknown Source)
                at java.base@11.0.23/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.base@11.0.23/java.lang.reflect.Method.invoke(Method.java:566)
                at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRpcInvocation$1(PekkoRpcActor.java:309)
                at app//org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
                at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:307)
                at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:[222](https://github.com/apache/iceberg/actions/runs/9867945567/job/27249123973?pr=10663#step:6:223))
                at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
                at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
                at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
                at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
                at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
                at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
                at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
                at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
                at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:[229](https://github.com/apache/iceberg/actions/runs/9867945567/job/27249123973?pr=10663#step:6:230))
                at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
                at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
                at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)
                at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)
                at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
                at java.base@11.0.23/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
                at java.base@11.0.23/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
                at java.base@11.0.23/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
                at java.base@11.0.23/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
                at java.base@11.0.23/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

Query engine

Flink

pvary commented 3 months ago

Still trying to reproduce the case locally, but 1500 runs, and no failure. See:

Screenshot 2024-07-26 at 15 53 06
stevenzwu commented 3 months ago

@pvary some of these race condition issues are very difficult to reproduce locally and tends to show up more likely in CI build. That is also my experience. My theory is that CI machines are likely more overloaded as they are shared resources.

This is a really hard issue to identify and may take some time. Would it make sense to disable this particular test for now?

pvary commented 3 months ago

@stevenzwu: Created a pull request for disabling the tests: #10802