asynkron / protoactor-kotlin

Ultra-fast distributed cross-platform actor framework
http://proto.actor
Apache License 2.0
221 stars 25 forks source link

Mailbox schedule() called in endless loop without any messages #35

Open james-cobb opened 6 years ago

james-cobb commented 6 years ago

Intermittently (hard to reproduce reliably) we see CPU spinning at 50% or 100% (one or two cores), with profiling showing that schedule() is being called repeatedly, even though no messages are being processed by actors.

I had thought the problem only related to bounded queues. But now think I've seen it with unbounded queues too. Would have to rerun a lot of tests to check this for sure.

Looking at DefaultMailbox - the system message queue has an atomic counter tracking whether there are messages to be polled.

On a hunch I added a similar atomic counter to check if there are messages in the user queue as well, and used the two atomic counters as the check if schedule() should be called. So replacing the isNotEmpty checks here:

https://github.com/AsynkronIT/protoactor-kotlin/blob/ca94fb9a2f18e5738c0d01d8f9749caee29293b1/proto-mailbox/src/main/kotlin/actor/proto/mailbox/DefaultMailbox.kt#L71

with this check:

if (sysCount.get() > 0 || (!suspended && userCount.get() > 0)) {

This seems to have done the trick - we no longer see CPU spinning.

If the above count check fails, I'm then doing the original (isNotEmpty) check, and if that check is true I'm logging. This shows there may be async access issues. Having checked that at least one of the queues isNotEmpty, the logging then shows that both of the queues are indeed empty. Can there be multiple threads removing from the queues? I thought consumer was only single threaded?

@rogeralsing What was the reason for adding the atomic sysCount in DefaultMailbox in the 'queues' commit? Was it to solve a similar issue that I'm seeing here? The need for the atomic counter might suggest maybe there are multiple threads consuming the queues?

rogeralsing commented 6 years ago

Very late answer here, but IIRC, it was a perf optimization. reading sysCount from an atomic was simply faster than asking the queue if there were any messages in it. As we are in a pure single consumer scenario, this value should be correct