cea-hpc / shine

Lustre administration tool
GNU General Public License v2.0
22 stars 9 forks source link

Stack trace when StartTarget action fails #190

Closed degremont closed 7 years ago

degremont commented 9 years ago

On shine start, if the start action fails, shine crashes and the following stack trace is displayed:

# shine start -f fsiscsi -t mdt -L
Status of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) ...
Status of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) succeeded
Load module 'lustre' ...
Load module 'lustre': 'lustre' is already loaded
Load module 'ldiskfs' ...
Load module 'ldiskfs': 'ldiskfs' is already loaded
Start of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) ...
Start of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) failed
>> mount.lustre: mount /dev/sdb at /mnt/fsiscsi/mdt/0 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
Traceback (most recent call last):
  File "/usr/sbin/shine", line 34, in <module>
    sys.exit(Controller().run_command())
  File "/usr/lib/python2.6/site-packages/Shine/Controller.py", line 254, in run_command
    rc = command.filter_rc(command.execute())
  File "/usr/lib/python2.6/site-packages/Shine/Commands/Base/FSLiveCommand.py", line 110, in execute
    result = max(result, self.execute_fs(fs, fs_conf, eh, vlevel))
  File "/usr/lib/python2.6/site-packages/Shine/Commands/Start.py", line 91, in execute_fs
    tunings=Tune.get_tuning(fs_conf, fs.components))
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/FileSystem.py", line 531, in start
    self._run_actions()
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/FileSystem.py", line 260, in _run_actions
    task_self().resume()
  File "/usr/lib/python2.6/site-packages/ClusterShell/Task.py", line 809, in resume
    self._resume()
  File "/usr/lib/python2.6/site-packages/ClusterShell/Task.py", line 772, in _resume
    self._run(self.timeout)
  File "/usr/lib/python2.6/site-packages/ClusterShell/Task.py", line 411, in _run
    self._engine.run(timeout)
  File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 677, in run
    self.runloop(timeout)
  File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/EPoll.py", line 178, in runloop
    self.remove_stream(client, stream)
  File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 471, in remove_stream
    self.remove(client)
  File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 454, in remove
    self._remove(client, abort, did_timeout)
  File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 442, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/usr/lib/python2.6/site-packages/ClusterShell/Worker/Popen.py", line 104, in _close
    self.worker.eh.ev_close(self.worker)
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 425, in ev_close
    self.set_status(ACT_ERROR)
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 156, in set_status
    action.launch()
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 218, in launch
    self._launch()
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 272, in _launch
    if not self._graph_ok(self._members):
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 188, in _graph_ok
    self.set_status(ACT_ERROR)
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 156, in set_status
    action.launch()
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 218, in launch
    self._launch()
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 272, in _launch
    if not self._graph_ok(self._members):
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 177, in _graph_ok
    dep.launch()
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 214, in launch
    if not self._graph_ok(self.deps):
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 188, in _graph_ok
    self.set_status(ACT_ERROR)
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Tune.py", line 109, in set_status
    self._server.action_event(self, 'failed', result)
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Server.py", line 184, in action_event
    self._del_action(act.NAME)
  File "/usr/lib/python2.6/site-packages/Shine/Lustre/Server.py", line 167, in _del_action
    self._running_actions.remove(act)
ValueError: list.remove(x): x not in list

This stack trace comes from the event handling mechanism brought into the Tune action by fix for ticket #50 (commit f1d5adaadf4faaa41f3f813d865dbc0951120f35) Method set_status() calls self._server.action_event('failed') which in turn calls Server._del_action(). The problem is there has never been a call to Server.action_event('start') before, so Server._running_actions is empty. Hence the exception.

A quick fix could be to test existence of the action before removing it, or enclose it in a try..except block. However, I suspect a dirty behaviour in Tune.set_status() (indeed, it is the only class calling action_event() inside set_status()), but I can't find out what would be the correct behaviour here.

Regards.

Reported by: theryf

degremont commented 9 years ago

I made a mistake in the ticket title, this should be: "Stack trace on shine start when StartTarget action fails"

Could you please change it for me, as it seems I don't have sufficient privileges to do it myself.

Thanks

Original comment by: theryf

degremont commented 9 years ago

Original comment by: degremont

degremont commented 9 years ago

The start event should have been sent by Tune._launch(). I'm trying to understand why this is not the case in your stack trace.

Reproduce this stack with -vv could be interesting.

Original comment by: degremont

degremont commented 9 years ago

The -vv option does not give more details except the command run by shine. The stack trace is exactly the same, and does not provide any hints about why Tune._launch() is not called before the call to self._run_actions() inside FileSystem.start(). It is quite hard to follow the recursive callflow of _graph_ok/launch/_launch triggered by the first actions.launch() in FileSystem.start().

Original comment by: theryf

degremont commented 9 years ago

The -vv option does not give more details except the command run by shine.

It was exactly what I was interested in.

Original comment by: degremont

degremont commented 9 years ago

Here is the fix. I will be working on a more complete version of this patch.

diff --git a/lib/Shine/Lustre/Actions/Tune.py b/lib/Shine/Lustre/Actions/Tune.py
index f52e5cc..14ab1f3 100644
--- a/lib/Shine/Lustre/Actions/Tune.py
+++ b/lib/Shine/Lustre/Actions/Tune.py
@@ -97,16 +97,17 @@ class Tune(ActionGroup):

         If this is a final state, raise corresponding events.
         """
-        if status == ACT_OK:
-            self._server.action_event(self, 'done')
-        elif status == ACT_ERROR:
-            # Build an error string
-            errors = []
-            for act in self:
-                if act.status() == ACT_ERROR:
-                    errors.append("'%s' failed" % act._command)
-            result = ErrorResult("\n".join(errors))
-            self._server.action_event(self, 'failed', result)
+       if self._init:
+            if status == ACT_OK:
+                self._server.action_event(self, 'done')
+            elif status == ACT_ERROR:
+                # Build an error string
+                errors = []
+                for act in self:
+                    if act.status() == ACT_ERROR:
+                        errors.append("'%s' failed" % act._command)
+                result = ErrorResult("\n".join(errors))
+                self._server.action_event(self, 'failed', result)

Original comment by: degremont

degremont commented 9 years ago

Original comment by: degremont

degremont commented 9 years ago

Fixed in [a136d1]

Original comment by: degremont