Add "debug" setting to Akka.Remote and Akka.Cluster config

Aaronontheweb commented 9 years ago

Had a great suggestion from a training attendee yesterday that we add a "debug = on" setting inside both Akka.Remote and Akka.Cluster that disables failure detectors, or gives them an indefinitely long "missed heartbeat" window for both Akka.Remote and Akka.Cluster.

The idea is to stop disassociations for happening if you need to set a breakpoint while debugging an Akka.Remote or Akka.Cluster application, because right now with the default settings that will trigger a heartbeat failure and make it somewhat frustrating to resume debugging.

The way I'd go about implementing this is creating a "debug" configuration for all of the failure detector settings inside the built-in HOCON confs for both modules. And if you wanted to be able to debug your application without disassociations you could just add the following to your app.config:

akka.remote.watch-failure-detector = debug-failure-detector
akka.remote.transport-failure-detector = debug-failure-detector

Running in production with these settings is obviously a terrible idea, but I think we can trust our users to be able to not give themselves enough rope to hang themselves.

Thoughts on this?

kekekeks commented 9 years ago

I think it should be automagically enabled if System.Diagnostics.Debugger.IsAttached is true

Aaronontheweb commented 9 years ago

@kekekeks yeah, I had that thought too. We could automatically inject that setting inside the RemoteActorRefProvider when it populates RemotingSettings

Aaronontheweb commented 9 years ago

@kekekeks would you be interested in submitting a PR for this feature? Otherwise I can mark it as "up for grabs"

Aaronontheweb commented 9 years ago

Renewed interest in this - would make debugging Akka.Cluster and Akka.Remote much less frustrating

maxim-s commented 8 years ago

I have something like that for debugging, I can add these settings

0x53A commented 7 years ago

This feature would be great to have - either automatically with Debugger.IsAttached or an explicit config change

0x53A commented 7 years ago

I have something like that for debugging, I can add these settings

Hi @maxim-s, do you have any workarounds for easing debugging a Akka.Remote system?

Basically we want to be able to debug-break any node in the system for up to ~10 minutes, without any adverse effects.

(I realize that without any adverse effects is not possible, but we can live with the fact that actually dead nodes will only be detected as dead after 10 minutes. This will only be active in DEBUG configuration, anyway.)

In general the whole actor system is low-trafic. We are seeing issues, I think, if the sender is debug-broken while a message is in-flight.

The log on the receiving end is

[ERROR][22.05.2017 12:33:43][Thread 0011][[akka://LRIEGER-10-neg-nemetschek-de-34316/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FLRIEGER-10-neg-nemetschek-de-34316%40%5B%3A%3Affff%3A192.168.175.71%5D%3A54904-4#509171535]] No response from remote. Handshake timed out or transport failure detector triggered.
Cause: Unknown
[ERROR][22.05.2017 12:33:43][Thread 0011][[akka://LRIEGER-10-neg-nemetschek-de-34316/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FLRIEGER-10-neg-nemetschek-de-11244%40lrieger-10.neg.nemetschek.de%3A54650-6#1783575276]] No response from remote. Handshake timed out or transport failure detector triggered.
Cause: Unknown
[WARNING][22.05.2017 12:33:43][Thread 0009][[akka://LRIEGER-10-neg-nemetschek-de-34316/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FLRIEGER-10-neg-nemetschek-de-11244%40lrieger-10.neg.nemetschek.de%3A54650-6#1659599239]] Association with remote system akka.tcp://LRIEGER-10-neg-nemetschek-de-11244@lrieger-10.neg.nemetschek.de:54650 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointDisassociatedException: Disassociated
   at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level, Boolean needToThrow)
   at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
   at Akka.Actor.ActorCell.<>c__DisplayClass112_0.<Akka.Actor.IUntypedActorContext.Become>b__0(Object m)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Akka.Actor.ActorCell.HandleFailed(Failed f)
   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)]

This message is then lost and never automatically re-transmitted; if it was an Ask it will time-out.

Is there an easy way to increase all timeouts? I have already increased the Ask timeout to 10 Minutes, which solves another class of problems, but has the unfortunate side-effect that if a message is lost, as in this case, the sender is practically deadlocked.

My general question is: What configuration settings do you use for debugging?

My specific question is: Which timeouts do I need to increase, so that an in-flight message won't be lost if the sender is stopped for a few minutes and then resumes execution?

ddobric commented 5 years ago

Any suggestion here? Have same issue.

chipdice commented 5 years ago

I'd be very interest to know how others are doing this as well. The debugging experience when working in a cluster could definitely be improved

ismaelhamed commented 5 years ago

Even though you may have partial success with configuring the failure detector different during debugging, in general a classical debugger is just not applicable to a distributed application, you cannot really “stop the world”. Debugging means tracing and logging in these cases, for the core parts we actually use println but you can rely upon Actors doing what they should --Roland Kuhn

IMO this piece becomes self evident over time.

Ralf1108 commented 5 years ago

Automatic injection of the debug flag via "System.Diagnostics.Debugger.IsAttached" would not cover cases when you attach to an akka process after it started. So an explicit option would be desirable.

oofpez commented 5 years ago

What is the best workaround for now? Would love to make my cluster debugging experience less painful somehow.

bournes commented 3 years ago

I also want to know！！！！！

markusschaber commented 3 years ago

Automatic injection of the debug flag via "System.Diagnostics.Debugger.IsAttached" would not cover cases when you attach to an akka process after it started.

The flag could be re-checked periodically. Maybe only on debug builds, not release builds, or depending on a configuration setting or environment variable.

A debug-activated process should gossip this to the other nodes, so they ignore timeouts on the heartbeats. Also, calls like Ask() could internally override the timeout when the target is on a debugged host.

However, when the tcp connection to the process is lost (which indicates it has been killed in the debugger) by the other nodes, normal processing should continue.

akkadotnet / akka.net

Add "debug" setting to Akka.Remote and Akka.Cluster config #1165