ORNL-TechInt / DDNTool_v2

Automated monitoring of DDN SFA hardware
Other
5 stars 4 forks source link

DDNTool crashes during controller reboot #3

Open treydock opened 7 years ago

treydock commented 7 years ago

We rebooted our DDN controllers and it crashed DDNTool. The logs show the process restarting but after that there are no logs from DDNTool and metrics stopped getting collected.

Apr 18 09:48:56 metrics DDNTool_SFAClient_ddn-scratch1b: - INFO - Waking up
Apr 18 09:48:56 metrics DDNTool: - ERROR - Process ddn-scratch1b caught APIException exception.#012Traceback (most recent call last):#012  File "/usr/bin/DDNTool.py", line 155, in one_controller#012    client.run()#012  File "/usr/lib/python2.7/site-packages/DDNToolSupport/SFAClientUtils/SFAClient.py", line 260, in run#012    self._fast_poll_tasks()#012  File "/usr/lib/python2.7/site-packages/DDNToolSupport/SFAClientUtils/SFAClient.py", line 308, in _fast_poll_tasks#012    vd_stats = SFAVirtualDiskStatistics.getAll()#012  File "/usr/lib/python2.7/site-packages/ddn/sfa/core.py", line 1095, in wrapper#012    raise APIException(ex)#012APIException: 1000: SFA/MI connection error
Apr 18 09:48:56 metrics DDNTool: - INFO - Process ddn-scratch1b is exiting.
Apr 18 09:49:26 metrics DDNTool: - ERROR - Process DDNTool_ddn-scratch1b has crashed!  Restarting!
Apr 18 09:49:26 metrics DDNTool: - INFO - Starting background process for ddn-scratch1b
rgmiller commented 7 years ago

Apr 18 09:48:56 metrics DDNTool: - ERROR - Process ddn-scratch1b caught APIException exception.

012Traceback (most recent call last):

012 File "/usr/bin/DDNTool.py", line 155, in one_controller

012 client.run()

012 File "/usr/lib/python2.7/site-packages/DDNToolSupport/SFAClientUtils/SFAClient.py", line 260, in run

012 self._fast_poll_tasks()

012 File "/usr/lib/python2.7/site-packages/DDNToolSupport/SFAClientUtils/SFAClient.py", line 308, in _fast_poll_tasks

012 vd_stats = SFAVirtualDiskStatistics.getAll()

012 File "/usr/lib/python2.7/site-packages/ddn/sfa/core.py", line 1095, in wrapper

012 raise APIException(ex)

012APIException: 1000: SFA/MI connection error

Judging by the traceback, you're getting an exception when you call SFAVirtualDiskStatistics.getAll() That's coming from the SFA library itself. I'm not sure what exception code 1000 means, but it looks like you're not able to contact the DDN hardware itself.

What firmware version are you running on your controllers? And is it something you just installed?

treydock commented 7 years ago

3.1.0.1 is current version. This issue occurred during scheduled reboot of our controllers. A restart of DDNTool resolved the issue.

rgmiller commented 7 years ago

That's interesting. DDNTool is supposed to be able to automatically reconnect after a DDN controller is rebooted, but I haven't explicitly tested that in a while. Possibly something changed in the 3.x firmware. I'll look into it.

rgmiller commented 7 years ago

Just a quick update: We tested this scenario with firmware version 3.0.1.5 and it looks like a bug on DDN's side. After rebooting the DDN, the APIConnect() function just hangs and never actually connects. We're going to upgrade our test system to the latest firmware and see if this still happens. If so, I'll file a bug report with DDN.