Closed grembo closed 5 months ago
It seems like this depends on the order in which nomad-pot-driver issues certain commands:
Example of a command sequence that left mounts behind:
2023-12-21T11:38:25+00:00 10.20.20.231 pot[42497]: pot-destroy -p myservice_fdf1f644_ad0150ca-d40b-f752-b564-8fe4d86c657e myservice -F
2023-12-21T11:38:25+00:00 10.20.20.231 pot[42476]: pot-set-status -p myservice_fdf1f644_ad0150ca-d40b-f752-b564-8fe4d86c657e -s stopped
2023-12-21T11:38:24+00:00 10.20.20.231 pot[40184]: pot-destroy -p myservice_fdf1f644_ad0150ca-d40b-f752-b564-8fe4d86c657e -F
2023-12-21T11:38:24+00:00 10.20.20.231 pot[40017]: pot-set-status -p myservice_fdf1f644_ad0150ca-d40b-f752-b564-8fe4d86c657e -s stopping
2023-12-21T11:38:18+00:00 10.20.20.231 pot[39032]: pot-set-status -p myservice_fdf1f644_ad0150ca-d40b-f752-b564-8fe4d86c657e -s stopping
2023-12-21T11:38:18+00:00 10.20.20.231 pot[38992]: pot-stop myservice_fdf1f644_ad0150ca-d40b-f752-b564-8fe4d86c657e myservice
Two things are of interest here:
So it looks like, the pot is stopped and destroyed twice. The second stopping call is after 5s, which looks like a nomad timeout. So the solution for this might be inside nomad, but it also feels like there's a lack of locking involved, being able to call stop and destroy multiple times in parallel.
It seems like this was caused by prometheus node-exporter running on the host. Excluding pot file systems solved the issue.
For reference:
service node_exporter enable
sysrc node_exporter_user=nodeexport
sysrc node_exporter_group=nodeexport
sysrc node_exporter_listen_address="127.0.0.1:9100"
echo '--log.level=warn
--collector.filesystem.mount-points-exclude=^/(dev|opt)($|/)
--collector.filesystem.fs-types-exclude=^(devfs|nullfs)$
--collector.netdev.device-exclude=^(p4|epair)' \
>/usr/local/etc/node_exporter_args
sysrc node_exporter_args="@/usr/local/etc/node_exporter_args"
Describe the bug When using pot with nomad, nomad's special directory mount-ins stay behind.
To Reproduce Run a basic nomad pot example (like nginx) and migrate it a couple of times (start/stop etc.).
After a while you will see something like this, even though only one container is running:
Expected behavior No leftover mounts
Additional context My suspicion is that umounts fail when the jail stops (maybe due to some processes still using the mountpoint). Later the ZFS filesystem is purged. Normal manual umount of these mounts work ok.