Closed einsiedlerkrebs closed 1 year ago
Thanks for opening this issue.
What does your system config look like, at least:
cat /etc/rc.conf
zpool status
(after reboot - redact as necessary)
nomad_enable="True"
nomad_user="root"
nomad_debug="YES"
nomad_env="PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/sbin:/bin"
nomad_dir="/var/tmp/nomad"
nomad_args="-config=/opt/hashicorp/nomad-agent.hcl"
zpool status shows both pools online.
my fix for the issue is:
#!/bin/sh
# After hard crash of a nomad node, remaining pots can't be pruned since their datasets are not mounted.
# This mounts the datasets and prunes pots.
zfs list -rH -o name zdata/pot/jails | xargs -L 1 zfs mount && logger -t pot_cleanup mounting all pot datasets || \
logger -t pot_cleanup failed to mount all pot datasets
pot prune && logger -t pot_cleanup pruning pots || logger -t pot_cleanup could not prune all pots
This supported by a RC script which runs it once before nomad.
@einsiedlerkrebs any reason you didn’t enable zfs in rc.conf? This can, e.g., be done using the service command:
# service zfs enable
It would take care of mounting zfs file systems on boot.
Yes indeed this solved the issue. Thanks.
I am experiencing the issue, that the pots are not cleaned up after hard reboot of a nomad node and therefore the jobs are failing.
When the system is up again, the ZFS datasets are not mounted into position, therefore the configuration file of a pot can not be found. This leads to a failing prune command and therefore to the inability to run "prepare" in nomad. Because of this reason the node in not reboot safe.
To reproduce:
on a single server nomad node with running services (via pot) run
reboot
commandafter system is up, observe that the desired services are not up
get pot datasets with
zfs list
mount each "service" related datasets and its recursive ones
run
pot prune
trigger fresh service start on nomad node (either setting count to 0 and back to 1 or removing database)
now service should be working again