Open AqlaSolutions opened 3 months ago
Hi, we recently added a link from the documentation to this article: https://home.robusta.dev/blog/stop-using-cpu-limits
Kubernetes is prone to throttle the CPU in this kind of systems, and thus resulting in timeouts. (the same applies to Orleans or GetEventstore also, anything realtime-ish)
Could you give that a try and see if this fixes the problems in your case?
We may try but according to our monitoring there is no high CPU activity going on at the time of the issue.
Could you also give this a try?
actorSystemConfig = actorSystemConfig with { SharedFutures = false };
And pass this in to the actor system config.
The exception you linked above is from the gossip loop and it seems to be timing out when trying to just get gossip state, indicating that the gossip actor is for some reason deadlocked. Maybe there is some unknown bug in the shared futures that are enabled by default.
That specific exception does not look like it could be kubernetes related tbh. I´ve started investigating this on my side also
We already use SharedFutures = false
. I am the one who reported the issue with SharedFutures)
Ah right. Can you see if you get any of these log messages for the gossip actor?:
Actor {Self} deadline {Deadline}, exceeded on message
e.g. search for "$gossip" in the logs
It would be great to know if the system detects if the gossip actor is timing out on messages
Or any of these
System {Id} - ThreadPool is running hot, ThreadPool latency {ThreadPoolLatency}"
Or this one
GossipActor Failed {MessageType}
I've found only this one and only once:
System {Id} - ThreadPool is running hot, ThreadPool latency 00:00:01.0000184
@AqlaSolutions I believe https://github.com/asynkron/protoactor-dotnet/pull/2133 may fix the underlying issue that can cause alot of Timeout in GossipReenterAfterSend
and Gossip loop failed
errors. It's available in the latest 1.6.1-alpha.0.25, hopefully you see an improvement.
Great news!
Hi, sometimes we see a spam of such error messages in a log (2-3 per second):
Timeout in GossipReenterAfterSend
. It may continue for hours. There are no other messages in the log preceding this. After an hour of this we also see these messages:They continue for hours and may be until a restart.
We can't reproduce it locally but it regularly happens in kuber on staging and prod servers. Is there a way to debug this? Any help?