Timeout in GossipReenterAfterSend

asynkron / protoactor-dotnet

Proto Actor - Ultra fast distributed actors for Go, C# and Java/Kotlin

http://proto.actor

Apache License 2.0

1.73k stars 288 forks source link

Timeout in GossipReenterAfterSend #2130

Open AqlaSolutions opened 3 months ago

AqlaSolutions commented 3 months ago

Hi, sometimes we see a spam of such error messages in a log (2-3 per second): Timeout in GossipReenterAfterSend. It may continue for hours. There are no other messages in the log preceding this. After an hour of this we also see these messages:

TimeoutException: Request didn't receive any Response within the expected time.\n at Proto.Future.FutureProcess.GetTask(CancellationToken cancellationToken)\n at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)\n at Proto.Cluster.Gossip.Gossiper.GetStateEntry(String key)\n at Proto.Cluster.Gossip.Gossiper.BlockGracefullyLeft()\n at Proto.Cluster.Gossip.Gossiper.GossipLoop()", "MessageTemplate": "Gossip loop failed"

They continue for hours and may be until a restart.

We can't reproduce it locally but it regularly happens in kuber on staging and prod servers. Is there a way to debug this? Any help?

rogeralsing commented 3 months ago

Hi, we recently added a link from the documentation to this article: https://home.robusta.dev/blog/stop-using-cpu-limits

Kubernetes is prone to throttle the CPU in this kind of systems, and thus resulting in timeouts. (the same applies to Orleans or GetEventstore also, anything realtime-ish)

Could you give that a try and see if this fixes the problems in your case?

AqlaSolutions commented 3 months ago

We may try but according to our monitoring there is no high CPU activity going on at the time of the issue.

rogeralsing commented 3 months ago

Could you also give this a try?

actorSystemConfig = actorSystemConfig with { SharedFutures = false };

And pass this in to the actor system config.

The exception you linked above is from the gossip loop and it seems to be timing out when trying to just get gossip state, indicating that the gossip actor is for some reason deadlocked. Maybe there is some unknown bug in the shared futures that are enabled by default.

That specific exception does not look like it could be kubernetes related tbh. I´ve started investigating this on my side also

AqlaSolutions commented 3 months ago

We already use SharedFutures = false. I am the one who reported the issue with SharedFutures)

rogeralsing commented 3 months ago

Ah right. Can you see if you get any of these log messages for the gossip actor?:

Actor {Self} deadline {Deadline}, exceeded on message

e.g. search for "$gossip" in the logs

It would be great to know if the system detects if the gossip actor is timing out on messages

rogeralsing commented 3 months ago

Or any of these

System {Id} - ThreadPool is running hot, ThreadPool latency {ThreadPoolLatency}"

rogeralsing commented 3 months ago

Or this one

GossipActor Failed {MessageType}

AqlaSolutions commented 3 months ago

I've found only this one and only once:

System {Id} - ThreadPool is running hot, ThreadPool latency 00:00:01.0000184

benbenwilde commented 2 months ago

@AqlaSolutions I believe https://github.com/asynkron/protoactor-dotnet/pull/2133 may fix the underlying issue that can cause alot of Timeout in GossipReenterAfterSend and Gossip loop failed errors. It's available in the latest 1.6.1-alpha.0.25, hopefully you see an improvement.

AqlaSolutions commented 2 months ago

Great news!