prune fails because of unmounted datasets

bsdpot / nomad-pot-driver

Nomad task driver for launching freebsd jails.

Apache License 2.0

84 stars 14 forks source link

prune fails because of unmounted datasets #35

Closed einsiedlerkrebs closed 1 year ago

einsiedlerkrebs commented 1 year ago

I am experiencing the issue, that the pots are not cleaned up after hard reboot of a nomad node and therefore the jobs are failing.

When the system is up again, the ZFS datasets are not mounted into position, therefore the configuration file of a pot can not be found. This leads to a failing prune command and therefore to the inability to run "prepare" in nomad. Because of this reason the node in not reboot safe.

To reproduce:

on a single server nomad node with running services (via pot) run reboot command
after system is up, observe that the desired services are not up
get pot datasets with zfs list
mount each "service" related datasets and its recursive ones
run pot prune
trigger fresh service start on nomad node (either setting count to 0 and back to 1 or removing database)
now service should be working again

grembo commented 1 year ago

Thanks for opening this issue.

What does your system config look like, at least:

cat /etc/rc.conf
zpool status

(after reboot - redact as necessary)

einsiedlerkrebs commented 1 year ago

nomad_enable="True"
nomad_user="root"
nomad_debug="YES"
nomad_env="PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/sbin:/bin"
nomad_dir="/var/tmp/nomad"
nomad_args="-config=/opt/hashicorp/nomad-agent.hcl"

zpool status shows both pools online.

my fix for the issue is:

#!/bin/sh
# After hard crash of a nomad node, remaining pots can't be pruned since their datasets are not mounted.
# This mounts the datasets and prunes pots.

zfs list -rH -o name zdata/pot/jails | xargs -L 1 zfs mount && logger -t pot_cleanup mounting all pot datasets || \
 logger -t pot_cleanup failed to mount all pot datasets
pot prune && logger -t pot_cleanup pruning pots || logger -t pot_cleanup could not prune all pots

This supported by a RC script which runs it once before nomad.

grembo commented 1 year ago

@einsiedlerkrebs any reason you didn’t enable zfs in rc.conf? This can, e.g., be done using the service command:

# service zfs enable

It would take care of mounting zfs file systems on boot.

einsiedlerkrebs commented 1 year ago

Yes indeed this solved the issue. Thanks.