EventSourcedBehavior/Durable State exposed to stack overflow when lots of read-only commands are in the stash

apache / pekko

Build highly concurrent, distributed, and resilient message-driven applications using Java/Scala

https://pekko.apache.org/

Apache License 2.0

1.17k stars 139 forks source link

EventSourcedBehavior/Durable State exposed to stack overflow when lots of read-only commands are in the stash #1327

Open JacobF7 opened 3 months ago

JacobF7 commented 3 months ago

Both EventSourcedBehavior and Durable state seem to be exposed to stack overflow exceptions when lots of read-only commands are in the stash. This bug has already been opened as an issue for Akka and a fix has also been recently merged. Are there any plans of fixing this in Pekko too?

pjfanning commented 3 months ago

Legally, we can't take any code that is committed to Akka since its license changed. If someone wants to attempt a clean room fix, that would be great.

Even a reproducible test case that someone else can use to do a clean room fix would be much appreciated.

He-Pin commented 3 months ago

@JacobF7 would you like to prepare a reproducer? I'm not using persistent at work.

pjfanning commented 3 months ago

I've read the https://github.com/akka/akka/issues/29933 but don't want to look at the PR since the license is not compatible with Apache Pekko. It seems like the issue is that 'tryUnstashOne will, after a few steps, call the onCommand method that in turn will call tryUnstashOne'.

So it seems like we need to make tryUnstashOne non-recursive. We may want to allow a small number of recursive calls but after a certain depth, we will need to use something like sending a message to continue the unstashing. There is a good chance that we will need to wait for a response message that tells us the async unstashing has completed.

Alternatively, we could look at turning this into a loop instead of using recursion.

pjfanning commented 3 months ago

@JacobF7 Can I get some background? I reproduced the issue but with a use case that is not real world. Have you hit this issue in the real world?

I have messed about with some changes but in all the honesty, they tend to break stuff.

If someone else wants to try their own solution, please feel free to have a look. Just a reminder, we can't accept any solutions that rely on examining the Akka changes.

One extra potential solution is to ignore the read only events if there are just too many of them. In my testing, it takes > 1000 such events to cause an issue. Is there a good reason not to limit the number of read only events?

He-Pin commented 3 months ago

@JacobF7 Would you like to provide a reproducer? thanks.

JacobF7 commented 3 months ago

@pjfanning - In our scenario (gambling), we have an actor that receives on peak loads 15,000 requests per second (both read and write events). Unfortunately, in this particular scenario we don't have the option of distributing the load because many users will need to read and write to the same resource (actor). Moreover write events need to be handled 1 by 1 in the same order that they are received in the inbox. From our performance tests, it seems that Pekko Persistence (Event Sourcing in particular, but also Durable State) was not the right fit for this amount of load, so instead we opted to use a regular actor without persistence. Persistence was handled by manually snapshotting the state of the actor periodically.

@He-Pin - As @pjfanning mentioned, the issue can be replicated by increasing the load (for instance submitting 2000 read requests).