YSmetana / raid_arcconf_zabbix_lld

Zabbix LLD and monitoring script for Adaptec RAID controllers
GNU General Public License v3.0
8 stars 12 forks source link

SMART #2

Closed idokaplan closed 6 years ago

idokaplan commented 7 years ago

Hi,

Very nice template! Is there any chance to add support to get output of the disks errors? getlogs 1 device tabular

Thanks! Ido

YSmetana commented 7 years ago

Sure. Will check it out.

How would you propose to use it? What values should trigger an alarm? Do you have any output example?

idokaplan commented 7 years ago

I would like to monitor medium errors because those disks are predicted to be failed.

I would like get an alarm if there are mediumErrors (>0).

For example - DeviceID 32 has 1 medium error. DeviceID 16 has 1 medium error.

c:\Program Files\Adaptec\maxView Storage Manager>arcconf getlogs 1 device tabular Controllers found: 1

Controller log Controller ID.................................... 0 Type............................................. 0 Time............................................. 1486214512 version ........................................ 3 tableFull ...................................... false

  driveErrorEntry
      smartError ..................................... false
      vendorID ....................................... SEAGATE
      serialNumber ................................... XXXXX
      wwn ............................................ XXXXX
      deviceID ....................................... 32
      productID ...................................... XXXX
      numParityErrors ................................ 0
      linkFailures ................................... 0
      hwErrors ....................................... 0
      abortedCmds .................................... 0
      mediumErrors ................................... 1
      smartWarning ................................... 0

  driveErrorEntry
      smartError ..................................... false
      vendorID ....................................... SEAGATE
      serialNumber ................................... XXXXX
      wwn ............................................ XXXXX
      deviceID ....................................... 16
      productID ...................................... XXXXX
      numParityErrors ................................ 0
      linkFailures ................................... 0
      hwErrors ....................................... 0
      abortedCmds .................................... 1
      mediumErrors ................................... 1
      smartWarning ................................... 0

Thanks! Ido

On Sat, Feb 4, 2017 at 10:30 AM, Yuriy Smetana notifications@github.com wrote:

Sure. Will check it out.

How would you propose to use it? What values shoul trigger an alarm? Do you have any output example?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/YSmetana/raid_arcconf_zabbix_lld/issues/2#issuecomment-277428631, or mute the thread https://github.com/notifications/unsubscribe-auth/ATt6NhGR9AKJ4vf9Ht-rcXnTLHaqxH0Fks5rZDcpgaJpZM4L2Xbe .

YSmetana commented 7 years ago

Sorry. Had no time today. Will check it a bit later. :(

idokaplan commented 7 years ago

Hi,

Did you have a chance to check it?

Thanks! Ido

idokaplan commented 7 years ago

Yuriy? :(

YSmetana commented 7 years ago

Started working on it. Sorry for delay.

idokaplan commented 7 years ago

Thank you for the follow up. Can I be rude and ask when it will be ready? :)

On Mon, Feb 20, 2017 at 11:45 AM, Yuriy Smetana notifications@github.com wrote:

Started working on it. Sorry for delay.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/YSmetana/raid_arcconf_zabbix_lld/issues/2#issuecomment-281032283, or mute the thread https://github.com/notifications/unsubscribe-auth/ATt6Nl6WdzEaJ2fucAAQKU7VnY_IJ4h1ks5reWC6gaJpZM4L2Xbe .

YSmetana commented 7 years ago

I don't know how to implement it. :) I need your advice thought.

It is a text log. It can consist of many entries. Let's say Zabbix read 5 entries. What should it do with it? Parse each entry to a separate parameters (abortedCmds, mediumErrors, smartWarning)? But they are parameters of some Log Entry not Device. OK, we can assign Log Entry to the corresponding Physical Device. But you can have 2 entries with "mediumErrors=1" and one entry with "mediumErrors=0" for the same device. Does it mean you currently have a problem? Should you clear the log after problem reporting? How to store it in Zabbix?

Do you have any ideas? ;)

idokaplan commented 7 years ago

Yes, I should clear manually the log after I have replaced the defective disk (not after problem reporting), so it not suppose to have 2 entries for the same device.

If we want to proceed with the same concept that you did, we can do this for example: raid_arcconf_zabbix_lld.py smart -1 lld {"data": [{"{#OBJ_TYPE}": "smart", "{#OBJ_ID}": 1}, {"{#OBJ_TYPE}": "smart","{#OBJ_ID}":4},{"{#OBJ_TYPE}": "smart", "{#OBJ_ID}": 14}, {"{#OBJ_TYPE}": "smart", "{#OBJ_ID}": 0}]}

OBJ_ID - is the device id

raid.arcconf[smart,{#OBJ_ID},mediumErrors]

What do you think?

idokaplan commented 7 years ago

Yuriy? :(

YSmetana commented 7 years ago

OBJ_ID should be an ID of the OBJ_TYPE. I.e. OBJ_ID 5 is SMART Object #5 not SMART of the Device #5. Newermind. We can deal with it.

But. Every SMART event will create a new Zabbix Item. We could have a hundreds of it. SMART events does not have any IDs only sequential. But order of the events can change easily...

What if we will just create a one Zabbix item (call it Events) that consist of all events (plain text) from all Devices (we can't get a particular device's event) plus certain item-markers like Events-smartError, Events-mediumErrors etc? And if any of the markers has an error value (from any Device) we will trigger a notification? I will try to put a Device vendor/serial in the notification to find the faulty drive easier.

What do you think?

idokaplan commented 7 years ago

There not suppose to have hundreds of events, will be few and only temporary (until disk replacement and clear).

I'm sorry, but I don't know what is item-markers. Can you please explain?

idokaplan commented 7 years ago

Yuriy? :(

YSmetana commented 6 years ago

Hello, I am very sorry to being silent. Unfortunately I have no access to such RAID controllers any more, so I can not test new features. If you have any proposals, please, correct the code and make a pull request. Thank you!