ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 581 forks source link

docker: Fix RA behavior when Docker service isn't running #1916

Closed ryan-ronnander closed 7 months ago

ryan-ronnander commented 7 months ago

After running into some failed monitor actions using the docker RA on Ubuntu 22.04, reverting some of the behavior of commit b7ae1bf resolves the issues.

Currently, the Docker resource agent works well if you do not manage the docker.service within Pacemaker. On Ubuntu 22.04 using systemd you can set the Docker service to enabled and start Docker before Pacemaker starts (key part).

However, attempting to manage the docker.service within Pacemaker causes some issues during the initial monitoring probe and it also completely breaks stop-all-resources=true expected behavior as the Docker resource agent's monitoring operations will always fail if the Docker service isn't running. Clean startup and shutdown are also affected and clearing failed actions is needed before resources will start.

As far as "live-restore" goes (referenced in commit b7ae1bf), shouldn't the approach be more along the lines of "don't do that" when managing individual containers using Pacemaker and the Docker resource agent? If the Docker service is unavailable or having issues assume the containers are unavailable/not running and let Pacemaker attempt to restart the Docker service or migrate containers to another node.

I haven't tested it, but after briefly looking at the Podman resource agent, it appears the Podman RA will simply return $OCF_NOT_RUNNING if the Podman service isn't running. This PR should bring the Docker and Podman resource agents into alignment regarding this behavior.

If I should create an issue to accompany this PR, just let me know and I'll gladly create one.

knet-jenkins[bot] commented 7 months ago

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-1916/1/input

oalbrigt commented 7 months ago

You should disable docker.service and let Pacemaker handle it.

ryan-ronnander commented 7 months ago

You should disable docker.service and let Pacemaker handle it.

Agreed. However, when doing so the Docker resource agent will fail initial monitoring probes as described above. This can be reproduced by simply performing a crm configure property stop-all-resources=true followed by a crm configure property stop-all-resources=false. Clearing failed actions is always required.

oalbrigt commented 7 months ago

Thanks.