ConPaaS-team / conpaas

ConPaaS: integrated runtime environment for elastic cloud applications
http://www.conpaas.eu
BSD 3-Clause "New" or "Revised" License
14 stars 3 forks source link

xtreemfs service has inconsistent information about the list of nodes #64

Open FrancoCaffarraAndEsterDiBello opened 9 years ago

FrancoCaffarraAndEsterDiBello commented 9 years ago

Good morning ConPaas team, during the analysis of the log we have seen that xtreemfs shows some problem with its list of nodes. In particular this is the scenario:

  1. we try to startup the service ( agent 967 is created )
  2. the startup fails and agent node 967 is killed
  3. we retry to startup ( agent 968 is created )
  4. the startup fails and agent node 968 is killed

but the manager still mentions of 967 that is the machine killed a few of seconds before. Here the significant rows of the log :

2014-09-17 13:26:13,337 DEBUG conpaas.core.controller attach_volume(node=conpaas.core.manager.node, volume=, device=vda)
2014-09-17 13:26:13,449 DEBUG conpaas.core.controller [delete_nodes]: killing iaas967
2014-09-17 13:26:13,449 DEBUG ReservationTimer RTIMER removed node iaas967, updated list []
2014-09-17 13:26:13,449 DEBUG ReservationTimer RTIMER Stopping timer for []
2014-09-17 13:26:13,449 DEBUG conpaas.core.clouds.base kill_instance(node=ServiceNode(id=iaas967, ip=172.16.116.19))
2014-09-17 13:26:13,454 ERROR conpaas.core.manager do_startup: Failed to request a new node
Traceback (most recent call last):
....

2014-09-17 13:28:15,139 DEBUG conpaas.core.manager _start_dir([ServiceNode(id=iaas967, ip=172.16.116.19), ServiceNode(id=iaas968, ip=172.16.116.20)])
2014-09-17 13:28:15,139 DEBUG conpaas.core.manager iaas967 already has a uuid (dir) -> 2d9d436e-3e6e-11e4-9df1-0200ac107412
2014-09-17 13:28:27,140 DEBUG conpaas.core.controller [delete_nodes]: killing iaas968
2014-09-17 13:28:27,140 DEBUG ReservationTimer RTIMER removed node iaas968, updated list []
2014-09-17 13:28:27,141 DEBUG ReservationTimer RTIMER Stopping timer for []
2014-09-17 13:28:27,141 DEBUG conpaas.core.clouds.base kill_instance(node=ServiceNode(id=iaas968, ip=172.16.116.20))
2014-09-17 13:28:27,169 ERROR conpaas.core.manager do_startup: Failed to request a new node
Traceback (most recent call last):
  File "/root/ConPaaS/src/conpaas/services/xtreemfs/manager/manager.py", line 289, in _do_startup
    self._start_dir(self.dirNodes)
  File "/root/ConPaaS/src/conpaas/services/xtreemfs/manager/manager.py", line 153, in _start_dir
    client.createDIR(node.ip, 5555, dir_uuid)
  File "/root/ConPaaS/src/conpaas/services/xtreemfs/agent/client.py", line 41, in createDIR
    return _check(https.client.jsonrpc_post(host, port, '/', method, params=params))
  File "/root/ConPaaS/src/conpaas/core/https/client.py", line 407, in jsonrpc_post
    h.endheaders()
  File "/usr/lib/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/root/ConPaaS/src/conpaas/core/https/client.py", line 94, in connect
    self.sock.connect((self.host, self.port))
  File "", line 1, in connect
error: [Errno 113] No route to host
noma commented 9 years ago

@FrancoCaffarraAndEsterDiBello Please attach the full log including the omitted traceback and the lines before the first line you posted (if there are any).

From what I see, this looks like a create_node() from the core should have thrown an exception indicating that node creation failed, but did not. This then leads to the code after the create_node being executed which adds the just deleted node to the managers data structures, and tries to start up services on a non-existing VM. This finally fails when the manager tries to communicate with the VM.