Cosium / zabbix_zfs-on-linux

zabbix template and user parameters to monitor zfs on linux
MIT License
76 stars 50 forks source link

Does not alert on cksum errors #6

Closed killmasta93 closed 4 years ago

killmasta93 commented 4 years ago

Hi, on zabbix it does not alert if the pool has an error on the cksum

root@prometheus26:~# zpool status pool: rpool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config:

NAME        STATE     READ WRITE CKSUM
rpool       ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    sda3    ONLINE       0     0     0
    sdb3    ONLINE       0     0     0
  mirror-1  ONLINE       0     0     0
    sdc     ONLINE       0     0     0
    sdd     ONLINE       0     0    33
AceSlash commented 4 years ago

You are right, this is not currently supported, usually you will be alerted by other metrics outside of ZFS for disk errors.

Nevertheless, this could be a good improvement.

killmasta93 commented 4 years ago

Thanks for the reply, is there going to be planned to update this? Thank you

AceSlash commented 4 years ago

@killmasta93 : I cannot give you a specific date. Currently the template doesn't handle the discovery of the vdevs, I took a quick look and I didn't see any other way than the parsing of the output of "zpool status" to get the list of the vdevs, which is not that easy and seems a little brittle.

I'll get back to you when I have some time to look further.

AceSlash commented 4 years ago

I have started the implementation. I got the list of all vdev with their state and read, write and checksum error counters.

For the alerting, I think that I'll raise an alert when any counter is > 0, but only once. I don't think there is any value to raising 2 or 3 alerts if a vdev has more than 1 counter > 0.

For example in your case, it will raise an alert saying "vdev /dev/sdd has 33 errors". If you got 5 write errors and 33 checksum error, it will instead say "vdev /dev/sdd has 38 errors". I want to avoid 2 alerts for the same vdev like "vdev /dev/sdd has 33 checksum errors" and "vdev /dev/sdd has 5 write errors".

killmasta93 commented 4 years ago

Thanks for the reply, should i update the script? to see on the alert? as i still have not clear the cksum error on my pool

AceSlash commented 4 years ago

@killmasta93 : not yet, I'm still testing it and it's not public yet. It should be done by the end of week if everything goes well. I'll tell you when it's done.

killmasta93 commented 4 years ago

thank you again, if i can help in anyway let me know :)

AceSlash commented 4 years ago

@killmasta93 testing is done and the new userparameters and template have been deployed on my infrastructure. I actually found out that I had an error on one disk with it!

It was a good idea ;-)

killmasta93 commented 4 years ago

Thank you so much, im glad it helped the idea, quick question for updating do i need to download the template.xml and replace it?

killmasta93 commented 4 years ago

edit: just updated it and got the alert thank very much

AceSlash commented 4 years ago

you're welcome