ArchipelProject / Archipel

XMPP Based Orchestrator
http://archipelproject.org
GNU Affero General Public License v3.0
803 stars 126 forks source link

agents don't always reconnect after XMPP server is restarted #977

Open Nowaker opened 10 years ago

Nowaker commented 10 years ago

I've shut down ejabberd for a few minutes. Then started it with a wrong config. Then started with a good config. After I started logged in via the Client my both hypervisors were off-line. They didn't even attempt to reconnect - nothing appears in the log except for vmcasts feed and some stats refresh. I had to restart both Archipel agents to get them back.

CyrilPeponnet commented 10 years ago

Works for me, I left ejabberd down for several hours, when starting back agent get registered.

CyrilPeponnet commented 10 years ago

Can you provide some logs also run the agent in debug mode (runarchipel -n) with xmpppy debug enabled, and try to reproduce the issue.

For the record,

The TNArchipelEntity is by default trying to reconnect every 5s without any maximum try attempts. The only way to exit the loop, it's when the user account has been removed from the server.

Nowaker commented 10 years ago

The only way to exit the loop, it's when the user account has been removed from the server.

So that's what actually happened.

Then started it with a wrong config.

A wrong config pointed to a different location of mnesia storage. So there were no users. And after I restarted ejabberd with a valid config, no Archipel agent tried to connect again (but it should).

CyrilPeponnet commented 10 years ago

In fact, ArchipelEntity describe both Hypervisor and VM. When you remove a VM, the user account is unregistered from the server and then the loop stop (as the thread).

Nowaker commented 10 years ago

But this shouldn't happen to a main XMPP connection? I mean hypervisor's connection.

CyrilPeponnet commented 10 years ago

I don't know, it could be a safety. Any way I need log to see exactly what's happenning (I don't have time to reproduce the scenario right now).

Nowaker commented 10 years ago

These thousands lines of logs won't tell you more than what you already said.

The only way to exit the loop, it's when the user account has been removed from the server.

Tell me if you really want these logs.

CyrilPeponnet commented 10 years ago

Just to confirm the point, as in the code we have:

   def loop(self):
        """
        This is the main loop of the client.
        """
        while not self.loop_status == ARCHIPEL_XMPP_LOOP_OFF:
            try:
                if self.loop_status == ARCHIPEL_XMPP_LOOP_REMOVE_USER:
                    self.process_inband_unregistration()
                    return
                if self.loop_status == ARCHIPEL_XMPP_LOOP_ON:
                    if self.xmppclient.isConnected():
                        if hasattr(self, "on_xmpp_loop_tick"):
                            self.on_xmpp_loop_tick()
                        self.xmppclient.Process(3)
                elif self.loop_status == ARCHIPEL_XMPP_LOOP_RESTART:
                    if self.xmppclient.isConnected():
                        self.xmppclient.disconnect()
                    time.sleep(1.0)
                    self.connect()

            except Exception as ex:
                if str(ex).upper().find('USER REMOVED') > -1:
                    self.log.info("LOOP EXCEPTION: Account has been removed from server.")
                    self.loop_status = ARCHIPEL_XMPP_LOOP_OFF
                else:
                    if  str(ex).upper().find('SYSTEM-SHUTDOWN') > -1:
                        self.log.warning("LOOP EXCEPTION: The XMPP server has been shut down. Waiting 5 second for reconnection")
                    else:
                        self.log.error("LOOP EXCEPTION : Disconnected from server. Trying to reconnect in 5 seconds.")
                        t, v, tr = sys.exc_info()
                        self.log.error("TRACEBACK: %s" % "\n".join(traceback.format_exception(t, v, tr)))
                    self.loop_status = ARCHIPEL_XMPP_LOOP_RESTART
                    time.sleep(5.0)

        if self.xmppclient.isConnected():
            self.xmppclient.disconnect()