Sometimes actors aren't disposed on error

ForNeVeR commented 8 years ago

Today I got that message repeated for every conference across the log file:

10:17:15.261 [CodingteamSystem-akka.actor.default-dispatcher-23734] ERROR akka.actor.OneForOneStrategy - actor name [mikelsv@conference.jabber.ru] is not unique!
akka.actor.InvalidActorNameException: actor name [mikelsv@conference.jabber.ru] is not unique!
        at akka.actor.dungeon.ChildrenContainer$TerminatingChildrenContainer.reserve(ChildrenContainer.scala:192) ~[horta-hell.jar:0.17.0]
        at akka.actor.dungeon.Children$class.reserveChild(Children.scala:77) ~[horta-hell.jar:0.17.0]
        at akka.actor.ActorCell.reserveChild(ActorCell.scala:369) [horta-hell.jar:0.17.0]
        at akka.actor.dungeon.Children$class.makeChild(Children.scala:202) ~[horta-hell.jar:0.17.0]
        at akka.actor.dungeon.Children$class.actorOf(Children.scala:38) ~[horta-hell.jar:0.17.0]
        at akka.actor.ActorCell.actorOf(ActorCell.scala:369) [horta-hell.jar:0.17.0]
        at ru.org.codingteam.horta.protocol.jabber.JabberProtocol$$anonfun$receive$1.applyOrElse(JabberProtocol.scala:58) ~[horta-hell.jar:0.17.0]
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465) ~[horta-hell.jar:0.17.0]
        at ru.org.codingteam.horta.protocol.jabber.JabberProtocol.aroundReceive(JabberProtocol.scala:19) ~[horta-hell.jar:0.17.0]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) [horta-hell.jar:0.17.0]
        at akka.actor.ActorCell.invoke(ActorCell.scala:487) [horta-hell.jar:0.17.0]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) [horta-hell.jar:0.17.0]
        at akka.dispatch.Mailbox.run(Mailbox.scala:220) [horta-hell.jar:0.17.0]
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) [horta-hell.jar:0.17.0]
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [horta-hell.jar:0.17.0]
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [horta-hell.jar:0.17.0]
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [horta-hell.jar:0.17.0]
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [horta-hell.jar:0.17.0]

ForNeVeR commented 8 years ago

By the way, the initial error itself was linked with some weirdest shit about CTOR and my Windows TCP stack. I wasn't been able to connect to XMPP either. Fixed by rebooting the machine.

rexim commented 8 years ago

The real questions is why does the log features @mikelsv? Did he hack your machines?

ForNeVeR commented 8 years ago

Nope, I am actually the owner of mikelsv@conference.jabber.ru room :)

It's a special room to test Horta there.

rexim commented 8 years ago

Ah, forgot about this one. Very well then. :)

ForNeVeR commented 8 years ago

I wasn't able to reproduce this problem throwing exceptions from random lines of JabberProtocol class. Have no idea what to do now.

rexim commented 8 years ago

I suggest to close this incident, but keep an eye on anything similar in the future. Also we may establish some sorta practice. When a new incident is reported we review all of the previous ones (opened and closed) to find if that happened before and connect them accordingly. That will help us to gather more information over time and eventually fix the root cause.

ForNeVeR commented 8 years ago

Agreed.

ForNeVeR commented 8 years ago

Red alert! We're having this again right now! Logs are saved to investigate; I'll try to reproduce.

ForNeVeR commented 8 years ago

According to my current investigation, we have a race condition between those lines: https://github.com/codingteam/horta-hell/blob/094d0a4db1919ea64311d6386441b610a0669523/src/main/scala/ru/org/codingteam/horta/protocol/jabber/JabberProtocol.scala#L67-L85

Sometimes the new room join (L58-59) starts before the old actor has been stopped (L85). Note that context.stop is asynchronous and have no guarantees to stop the actor immediately. It seems that sometimes it takes more than rejoinInterval to stop the actor, so JabberProtocol tries to start the new actor before the old actor has been completely stopped.

It seemed to me that we could just ignore old actors and remove the second argument from context.actorOf, but then I have the new problem with java.net.SocketTimeoutException: Read timed out. It seems that the server currently hates horta. We need to implement adaptive reconnecting timeout for both server reconnecting and room reconnecting.

ForNeVeR commented 8 years ago

So, well, probably we're overusing smack and have some threading issues with it. Let's put a pair of crutches to make it work for now and reduce the number of the rooms for the production horta instance. My first priority task for the weekend will be reimplementation of JabberProtocol in a more library-independent (and thread-safe if the library needs that) manner. Later we'll maybe migrate to some another library.

codingteam / horta-hell

Sometimes actors aren't disposed on error #407