Open Aaronontheweb opened 9 years ago
I think it should be automagically enabled if System.Diagnostics.Debugger.IsAttached
is true
@kekekeks yeah, I had that thought too. We could automatically inject that setting inside the RemoteActorRefProvider
when it populates RemotingSettings
@kekekeks would you be interested in submitting a PR for this feature? Otherwise I can mark it as "up for grabs"
Renewed interest in this - would make debugging Akka.Cluster and Akka.Remote much less frustrating
I have something like that for debugging, I can add these settings
This feature would be great to have - either automatically with Debugger.IsAttached or an explicit config change
I have something like that for debugging, I can add these settings
Hi @maxim-s, do you have any workarounds for easing debugging a Akka.Remote system?
Basically we want to be able to debug-break any node in the system for up to ~10 minutes, without any adverse effects.
(I realize that without any adverse effects is not possible, but we can live with the fact that actually dead nodes will only be detected as dead after 10 minutes. This will only be active in DEBUG configuration, anyway.)
In general the whole actor system is low-trafic. We are seeing issues, I think, if the sender is debug-broken while a message is in-flight.
The log on the receiving end is
[ERROR][22.05.2017 12:33:43][Thread 0011][[akka://LRIEGER-10-neg-nemetschek-de-34316/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FLRIEGER-10-neg-nemetschek-de-34316%40%5B%3A%3Affff%3A192.168.175.71%5D%3A54904-4#509171535]] No response from remote. Handshake timed out or transport failure detector triggered.
Cause: Unknown
[ERROR][22.05.2017 12:33:43][Thread 0011][[akka://LRIEGER-10-neg-nemetschek-de-34316/system/transports/akkaprotocolmanager.tcp.0/akkaProtocol-tcp%3A%2F%2FLRIEGER-10-neg-nemetschek-de-11244%40lrieger-10.neg.nemetschek.de%3A54650-6#1783575276]] No response from remote. Handshake timed out or transport failure detector triggered.
Cause: Unknown
[WARNING][22.05.2017 12:33:43][Thread 0009][[akka://LRIEGER-10-neg-nemetschek-de-34316/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FLRIEGER-10-neg-nemetschek-de-11244%40lrieger-10.neg.nemetschek.de%3A54650-6#1659599239]] Association with remote system akka.tcp://LRIEGER-10-neg-nemetschek-de-11244@lrieger-10.neg.nemetschek.de:54650 has failed; address is now gated for 5000 ms. Reason is: [Akka.Remote.EndpointDisassociatedException: Disassociated
at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level, Boolean needToThrow)
at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
at Akka.Actor.ActorCell.<>c__DisplayClass112_0.<Akka.Actor.IUntypedActorContext.Become>b__0(Object m)
at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
at Akka.Actor.ActorCell.ReceiveMessage(Object message)
at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
at Akka.Actor.ActorCell.Invoke(Envelope envelope)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Akka.Actor.ActorCell.HandleFailed(Failed f)
at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState)]
This message is then lost and never automatically re-transmitted; if it was an Ask
it will time-out.
Is there an easy way to increase all timeouts? I have already increased the Ask timeout to 10 Minutes, which solves another class of problems, but has the unfortunate side-effect that if a message is lost, as in this case, the sender is practically deadlocked.
My general question is: What configuration settings do you use for debugging?
My specific question is: Which timeouts do I need to increase, so that an in-flight message won't be lost if the sender is stopped for a few minutes and then resumes execution?
Any suggestion here? Have same issue.
I'd be very interest to know how others are doing this as well. The debugging experience when working in a cluster could definitely be improved
Even though you may have partial success with configuring the failure detector different during debugging, in general a classical debugger is just not applicable to a distributed application, you cannot really “stop the world”. Debugging means tracing and logging in these cases, for the core parts we actually use println but you can rely upon Actors doing what they should --Roland Kuhn
IMO this piece becomes self evident over time.
Automatic injection of the debug flag via "System.Diagnostics.Debugger.IsAttached" would not cover cases when you attach to an akka process after it started. So an explicit option would be desirable.
What is the best workaround for now? Would love to make my cluster debugging experience less painful somehow.
I also want to know!!!!!
Automatic injection of the debug flag via "System.Diagnostics.Debugger.IsAttached" would not cover cases when you attach to an akka process after it started.
The flag could be re-checked periodically. Maybe only on debug builds, not release builds, or depending on a configuration setting or environment variable.
A debug-activated process should gossip this to the other nodes, so they ignore timeouts on the heartbeats. Also, calls like Ask() could internally override the timeout when the target is on a debugged host.
However, when the tcp connection to the process is lost (which indicates it has been killed in the debugger) by the other nodes, normal processing should continue.
Had a great suggestion from a training attendee yesterday that we add a "debug = on" setting inside both Akka.Remote and Akka.Cluster that disables failure detectors, or gives them an indefinitely long "missed heartbeat" window for both Akka.Remote and Akka.Cluster.
The idea is to stop disassociations for happening if you need to set a breakpoint while debugging an Akka.Remote or Akka.Cluster application, because right now with the default settings that will trigger a heartbeat failure and make it somewhat frustrating to resume debugging.
The way I'd go about implementing this is creating a "debug" configuration for all of the failure detector settings inside the built-in HOCON confs for both modules. And if you wanted to be able to debug your application without disassociations you could just add the following to your app.config:
Running in production with these settings is obviously a terrible idea, but I think we can trust our users to be able to not give themselves enough rope to hang themselves.
Thoughts on this?