apache / pekko

Build highly concurrent, distributed, and resilient message-driven applications using Java/Scala
https://pekko.apache.org/
Apache License 2.0
1.13k stars 137 forks source link

fix: resolves and addition tests for child preStart supervise #1385

Closed Roiocam closed 4 days ago

Roiocam commented 2 weeks ago

Related with #1384 and resolves #1384, some of code contributed by @sadekmunawar

if PR can reproducer this issue, the ci checks will be unable to successd: https://github.com/apache/pekko/actions/runs/9770536752/job/26971887726?pr=1385

Roiocam commented 2 weeks ago

From the perspective of stack frame, the problem lies in the handling:

https://github.com/apache/pekko/blob/46c557424d19bb2d07adf2fbd55351a8187ff359/actor/src/main/scala/org/apache/pekko/actor/dungeon/FaultHandling.scala#L158-L163


[ERROR] [07/03/2024 09:46:12.113] [SupervisorSpec-pekko.actor.default-dispatcher-5] [pekko://SupervisorSpec/user/$a/$a] changing Recreate into Create after org.apache.pekko.actor.ActorInitializationException: pekko://SupervisorSpec/user/$a/$a: exception during creation, root cause message: [deliberate test failure]
[ERROR] [07/03/2024 09:46:12.114] [SupervisorSpec-pekko.actor.default-dispatcher-6] [pekko://SupervisorSpec/user/$a/$a] assertion failed
java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:264)
    at org.apache.pekko.actor.dungeon.FaultHandling.faultCreate(FaultHandling.scala:160)
    at org.apache.pekko.actor.dungeon.FaultHandling.faultCreate$(FaultHandling.scala:158)
    at org.apache.pekko.actor.ActorCell.faultCreate(ActorCell.scala:420)
    at org.apache.pekko.actor.dungeon.FaultHandling.faultRecreate(FaultHandling.scala:96)
    at org.apache.pekko.actor.dungeon.FaultHandling.faultRecreate$(FaultHandling.scala:92)
    at org.apache.pekko.actor.ActorCell.faultRecreate(ActorCell.scala:420)
    at org.apache.pekko.actor.ActorCell.invokeAll$1(ActorCell.scala:526)
    at org.apache.pekko.actor.ActorCell.systemInvoke(ActorCell.scala:545)
    at org.apache.pekko.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:297)
    at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:232)
    at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:245)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
He-Pin commented 2 weeks ago

I think we can attach a test case for this, will look into this over the weekend.

Roiocam commented 2 weeks ago

@sadekmunawar Hi, my unit test partly refers to your code in #1383 links.

Would you like to contribute this to the Pekko community?

Roiocam commented 2 weeks ago

Would you like to contribute this to the Pekko community?

@pjfanning Am I doing this correctly, may I ask?

pjfanning commented 2 weeks ago

Would you like to contribute this to the Pekko community?

@pjfanning Am I doing this correctly, may I ask?

The test code was written by https://discuss.lightbend.com/u/nodeninja/summary

@sadekmunawar can you confirm if that is you? We received your iCLA.

sadekmunawar commented 2 weeks ago

The test code was written by https://discuss.lightbend.com/u/nodeninja/summary

@sadekmunawar can you confirm if that is you? We received your iCLA.

@pjfanning yes, that's me. After finding the bug, I wrote that test for a solution I implemented. My solution converts the FatallyFailed into a case class and attaches an Optional ActorRef to it.

I also wrote a scala version of this test in ActorLifCyleSpec. Feel free to use either version of the test

"must not break supervisor strategy due to unhandled exception in preStart" in {
  val id = newUuid.toString
  val gen = new AtomicInteger(0)
  val maxRetryNum = 3

  val childProps = Props(new LifeCycleTestActor(testActor, id, gen) {
    override def preStart(): Unit = {
      report("preStart")
      throw new Exception("test exception")
    }
  }).withDeploy(Deploy.local)

  val supervisorStrategy: SupervisorStrategy =
    OneForOneStrategy(maxNrOfRetries = maxRetryNum, timeout.duration) {
      case _: ActorInitializationException => SupervisorStrategy.Restart
      case _                               => SupervisorStrategy.Escalate
    }
  val supervisor =  system.actorOf(Props(classOf[Supervisor], supervisorStrategy))
  Await.result((supervisor ? childProps).mapTo[ActorRef], timeout.duration)

  (0 to maxRetryNum).foreach(i => {
    expectMsg(("preStart", id, i))
  })
  expectNoMessage()
  system.stop(supervisor)
}
laglangyue commented 1 week ago

lgtm. Although I have read the relevant code, I did not delve into testing and debugging

Roiocam commented 1 week ago

I also wrote a scala version of this test in ActorLifCyleSpec. Feel free to use either version of the test

Thanks, I have updated this PR and verified it works in locally

pjfanning commented 4 days ago

@Roiocam should we backport this to this 1.0.x branch?

Roiocam commented 4 days ago

@Roiocam should we backport this to this 1.0.x branch?

Of course, this is a bug. backport on #1396

sadekmunawar commented 2 days ago

It looks like the checks did not run the tests in SupervisorHierarchySpec.

When I ran the tests locally, two tests in SupervisorHierarchySpec failed.

Can someone check if these two unit tests pass?

pjfanning commented 2 days ago

It looks like the checks did not run the tests in SupervisorHierarchySpec.

When I ran the tests locally, two tests in SupervisorHierarchySpec failed.

  • handle failure in creation when supervision strategy returns Resume and Restart
  • survive being stressed

Can someone check if these two unit tests pass?

Those tests are tagged as long-running tests. Maybe such tests are not run by default.

Did those tests fail before this PR was added?

sadekmunawar commented 2 days ago

Did those tests fail before this PR was added?

No, they passed in my local workspace before this PR was added. Unless there's something wrong with my local setup, this PR is the likely cause of the failures. I ran into issues with these two unit tests when experimenting with my own fix.

Roiocam commented 2 days ago

I will investigate it. :)

Roiocam commented 2 days ago

both survive being stressed and handle failure in creation when supervision strategy returns Resume and Restart has passes in my local by using #1399