Closed tvtue closed 3 years ago
I don't see how this can happen on a used system since zfs.arcstats[hits] or zfs.arcstats[misses] cannot be 0 at the same time.
Did this happen on a completely unused system where the arc is not used at all?
It occurs a few moments after applying the template to a host. Then the items zfs.arcstats[misses] and zfs.arcstats[hits] are 0 (zero) and so the formula does a division by zero.
100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+last(zfs.arcstats[misses])))
I've watched the item becoming supported as soon as zfs.arcstats[hits] gets a value sometime.
I must take back my last comment partly. I am still seeing the item unsupported. I looked into this again and first I noticed that SELinux may have been a problem. I saw denials for zabbix-agent doing his thing and so I changed some SELinux rules to give it access. Still the two items zfs.arcstats[hits] and zfs.arcstats[misses] are zero in the zabbix frontend (latest data). So I tried to manually get them with zabbix_get -s ... -k ... which works. As they are agent active items I also raised the log level of the zabbix agent to see problems if any. This is what I am seeing for "misses":
16846:20200609:101858.685 EXECUTE_STR() command:'awk '/^misses/ {printf $3;}' /proc/spl/kstat/zfs/arcstats' len:5 cmd_result:'49282' 16846:20200609:101858.685 for key [zfs.arcstats[misses]] received value [49282] 16846:20200609:101858.685 In process_value() key:'myhost:zfs.arcstats[misses]' lastlogsize:null value:'49282' 16846:20200609:101858.685 In send_buffer() host:'my_zabbix_server_ip' port:10051 entries:14/100 16846:20200609:101858.685 send_buffer() now:1591690738 lastsent:1591690737 now-lastsent:1 BufferSend:5; will not send now 16846:20200609:101858.685 End of send_buffer():SUCCEED 16846:20200609:101858.685 buffer: new element 14 16846:20200609:101858.685 End of process_value():SUCCEED 16846:20200609:101858.685 In need_meta_update() key:zfs.arcstats[misses] 16846:20200609:101858.685 End of need_meta_update():FAIL 16846:20200609:101858.685 In send_buffer() host:'my_zabbix_server_ip' port:10051 entries:15/100 16846:20200609:101858.685 send_buffer() now:1591690738 lastsent:1591690737 now-lastsent:1 BufferSend:5; will not send now 16846:20200609:101858.685 End of send_buffer():SUCCEED
I am not sure what "End of need_meta_update():FAIL" means but I would asume that it is not relevant in this problem is it?
Anyway, I don't know why this happens and how I can debug this further.
Do you have an idea or a tip for me?
Did you use sudo to run the zabbix-agent commands? You can also give the zabbix user a shell to test as the zabbix user. You should have the same result as the agent this way.
sudo -u zabbix zabbix_agentd -t zfs.arcstats[miss]
Hi AceSlash, thank you for your reply. Here is the output from the sudo command test.
[root@ub31 ~]# sudo -u zabbix zabbix_agentd -t zfs.arcstats[miss]
zfs.arcstats[miss] [t|7637468]
Okay, the result is correct. I'm not sure how to debug from here... a quick fix would be maybe to add 1 to the formula so that (last(zfs.arcstats[hits])+last(zfs.arcstats[misses]))
would never be 0, even on unused system.
This is definitively an edge case, but changing the formula to this would prevent any division by 0:
100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+last(zfs.arcstats[misses])+1))
Thank you for your fix. I applied the new formula and the item stayed supported since then. So no division by zero any more. Thank you.
It's better way to avoid it and have a correct data is:
100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+count(zfs.arcstats[hits],#1,0)+last(zfs.arcstats[misses])+count(zfs.arcstats[misses],#1,0)))
It's approved solution from zabbix team :)
@sharewax : smart! I had to look at the count
documentation but for anyone wondering what this does, the count will return 1 if the last value is 0, else it will return 0.
As a result, when the zfs.arcstats[hits]
is 0, we will have 1, and same for zfs.arcstats[misses]
. Actually, we don't need both, just one will be enough to avoid the division by 0.
This makes the formula do the same thing but is shorter:
100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+count(zfs.arcstats[hits],#1,0)+last(zfs.arcstats[misses])))
I'll make the change to master.
Hi, the calculated item "ZFS ARC Cache Hit Ratio" with the key zfs.arcstats_hit_ratio has become unsupported on one of my monitored hosts. The reason is given as "Cannot evaluate expression: division by zero."
It is calculated with this formular: 100*(last(zfs.arcstats[hits])/(last(zfs.arcstats[hits])+last(zfs.arcstats[misses])))
Would it be worth doing this a little more sophisticated so that the divisor should never be zero.?