Closed ForNeVeR closed 8 years ago
By the way, the initial error itself was linked with some weirdest shit about CTOR and my Windows TCP stack. I wasn't been able to connect to XMPP either. Fixed by rebooting the machine.
The real questions is why does the log features @mikelsv? Did he hack your machines?
Nope, I am actually the owner of mikelsv@conference.jabber.ru room :)
It's a special room to test Horta there.
Ah, forgot about this one. Very well then. :)
I wasn't able to reproduce this problem throwing exceptions from random lines of JabberProtocol
class. Have no idea what to do now.
I suggest to close this incident, but keep an eye on anything similar in the future. Also we may establish some sorta practice. When a new incident is reported we review all of the previous ones (opened and closed) to find if that happened before and connect them accordingly. That will help us to gather more information over time and eventually fix the root cause.
Agreed.
Red alert! We're having this again right now! Logs are saved to investigate; I'll try to reproduce.
According to my current investigation, we have a race condition between those lines: https://github.com/codingteam/horta-hell/blob/094d0a4db1919ea64311d6386441b610a0669523/src/main/scala/ru/org/codingteam/horta/protocol/jabber/JabberProtocol.scala#L67-L85
Sometimes the new room join (L58-59) starts before the old actor has been stopped (L85). Note that context.stop
is asynchronous and have no guarantees to stop the actor immediately. It seems that sometimes it takes more than rejoinInterval
to stop the actor, so JabberProtocol
tries to start the new actor before the old actor has been completely stopped.
It seemed to me that we could just ignore old actors and remove the second argument from context.actorOf
, but then I have the new problem with java.net.SocketTimeoutException: Read timed out
. It seems that the server currently hates horta. We need to implement adaptive reconnecting timeout for both server reconnecting and room reconnecting.
So, well, probably we're overusing smack and have some threading issues with it. Let's put a pair of crutches to make it work for now and reduce the number of the rooms for the production horta instance. My first priority task for the weekend will be reimplementation of JabberProtocol
in a more library-independent (and thread-safe if the library needs that) manner. Later we'll maybe migrate to some another library.
Today I got that message repeated for every conference across the log file: