Linuxfabrik / monitoring-plugins

220+ check plugins for Icinga and other Nagios-compatible monitoring applications. Each plugin is a standalone command line tool (written in Python) that provides a specific type of check.
https://linuxfabrik.ch
The Unlicense
220 stars 51 forks source link

disk-usage: ZFS zpool isn't supported #715

Closed Cyberes closed 1 year ago

Cyberes commented 1 year ago

This issue respects the following points:

Which variant of the Monitoring Plugins do you use?

Bug description

disk-usage doesn't support ZFS zpools:

Mountpoint ! Type ! Size     ! Used     ! Avail    ! Use%  
-----------+------+----------+----------+----------+-------
/          ! ext4 ! 100.9GiB ! 37.1GiB  ! 59.3GiB  ! 38.5% 
/local-ssd ! zfs  ! 325.3GiB ! 128.0KiB ! 325.3GiB ! 0.0%  
/local-zfs ! zfs  ! 1.6TiB   ! 128.0KiB ! 1.6TiB   ! 0.0%

Compare this to zpool list:

root@example# zpool list local-ssd
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
local-ssd   928G   360G   568G        -         -    46%    38%  1.00x    ONLINE  -

Steps to reproduce - Plugin call

disk-usage

Steps to reproduce - Data

Just run disk-usage on a system with a zpool.

Environment

Linux eloo 6.2.16-3-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-3 (2023-06-17T05:58Z) x86_64 GNU/Linux

Plugin Version

disk-usage: v2023071203 by Linuxfabrik GmbH, Zurich/Switzerland

Python version

Python 3.11.2

List of Python modules

No response

Additional Information

This is a pretty easy fix. I'll leave this example code if you'd like to implement something similar:

def is_zfs(mountpoint):
    result = subprocess.run(['df', '-T', mountpoint], stdout=subprocess.PIPE)
    lines = result.stdout.decode('utf-8').split('\n')
    if len(lines) > 1:
        filesystem = lines[1].split()[1]
        return filesystem == 'zfs'
    return False

def get_zfs_usage(pool_name):
    result = subprocess.run(['zpool', 'list', '-H', '-o', 'size,alloc,free', pool_name], stdout=subprocess.PIPE)
    size, used, free = result.stdout.decode('utf-8').strip().split('\t')

    # Convert to bytes
    size = convert_to_bytes(size)
    used = convert_to_bytes(used)
    free = convert_to_bytes(free)

    # Calculate percent
    percent = (used / size) * 100

    return sdiskusage(total=size, used=used, free=free, percent=percent)

def convert_to_bytes(size):
    size = size.lower()
    if 't' in size:
        return int(float(size.replace('t', '')) * 1024 ** 4)
    if 'g' in size:
        return int(float(size.replace('g', '')) * 1024 ** 3)
    if 'm' in size:
        return int(float(size.replace('m', '')) * 1024 ** 2)
    if 'k' in size:
        return int(float(size.replace('k', '')) * 1024)
    return int(size)
    for part in parts:
        ...
        if is_zfs(part.mountpoint):
            usage = get_zfs_usage(part.mountpoint.strip('/'))
        else:
            usage = psutil.disk_usage(part.mountpoint)
        ...
markuslf commented 1 year ago

ZFS is supported. I did some tests, created a raidz2 pool and added data.

My pool at the beginning:

# zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data        2.75G   514K  2.75G        -         -     1%     0%  1.00x    ONLINE  -
  raidz2-0  2.75G   514K  2.75G        -         -     1%  0.01%      -    ONLINE
    vdb     1.99G      -      -        -         -      -      -      -    ONLINE
    vdc     2.99G      -      -        -         -      -      -      -    ONLINE
    vdd     1014M      -      -        -         -      -      -      -    ONLINE

What Linux says - note the difference:

# df -hT
Filesystem                 Type      Size  Used Avail Use% Mounted on
data                       zfs       807M  128K  807M   1% /data

The check-plugin says the same:

# /usr/lib64/nagios/plugins/disk-usage 
Mountpoint ! Type ! Size      ! Used     ! Avail    ! Use%  
-----------+------+-----------+----------+----------+-------
/data      ! zfs  ! 806.9MiB  ! 128.0KiB ! 806.8MiB ! 0.0%  

According to df, adding 600M data to the pool should result in round about 75% usage:

# dd if=/dev/urandom of=/data/test bs=10M count=60 && sync

Let's check:

# df -hT
Filesystem                 Type      Size  Used Avail Use% Mounted on
data                       zfs       807M  599M  209M  75% /data

Plugin:

# /usr/lib64/nagios/plugins/disk-usage 
Mountpoint ! Type ! Size      ! Used     ! Avail    ! Use%  
-----------+------+-----------+----------+----------+-------
/data      ! zfs  ! 806.9MiB  ! 598.1MiB ! 208.8MiB ! 74.1% 

So perfectly aligned. ZFS says:

# zpool list
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data        2.75G  1.76G  1014M        -         -     0%    63%  1.00x    ONLINE  -
  raidz2-0  2.75G  1.76G  1014M        -         -     0%  64.0%      -    ONLINE

According to the plugin, adding additonal 250M data should fail, whereas zpool list suggests that it should work. Let's see:

# dd if=/dev/urandom of=/data/test2 bs=10M count=25 && sync
dd: error writing '/data/test2': No space left on device
# df -hT
Filesystem                 Type      Size  Used Avail Use% Mounted on
data                       zfs       807M  807M  512K 100% /data
# /usr/lib64/nagios/plugins/disk-usage 
Mountpoint ! Type ! Size      ! Used     ! Avail    ! Use%             
-----------+------+-----------+----------+----------+------------------
/data      ! zfs  ! 806.9MiB  ! 806.4MiB ! 512.0KiB ! 99.9% [CRITICAL] 
# zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data  2.75G  2.37G   387M        -         -    12%    86%  1.00x    ONLINE  -

So don't rely on zpool list.

man zpoolprops says about free: _The amount of free space available in the pool. By contrast, the zfs(8) available property describes how much new data can be written to ZFS filesystems/volumes. The zpool free property is not generally useful for this purpose, and can be substantially more than the zfs available space. This discrepancy is due to several factors, including raidz parity; zfs reservation, quota, refreservation, and refquota properties; and space set aside by spa_slopshift (see zfs(4) for more information).

Cyberes commented 1 year ago

Huh, thanks for the thorough investigation. This is a zpool managed by Proxmox so I assume it's doing something on its end that isn't carrying over to df (like maybe subvols).