IDR / deployment

Deployment infrastructure for the Image Data Resource
https://idr.openmicroscopy.org/about/deployment.html
BSD 2-Clause "Simplified" License
13 stars 14 forks source link

Search engine: docker deployment issues #415

Open sbesson opened 10 months ago

sbesson commented 10 months ago

Possibly affects the IDR monitoring stack as well

Initially reported by @dominikl in the context of a pilot VM, https://github.com/IDR/deployment/blob/0ec6d8d4bb7af1e1df3ab8b67835df7f10da436e/ansible/idr-docker.yml#L7-L8 currently fails with

RUNNING HANDLER [ome.docker : restart docker] *****************************************************************************************************************************************************************
fatal: [test120-searchengine]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python"}, "changed": false, "msg": "Unable to restart service docker: Job for docker.service failed because the control process exited with error code. See \"systemctl status docker.service\" and \"journalctl -xe\" for details.\n"}

Looking at the logs

Feb 02 13:42:21 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:21.663010221Z" level=info msg="Starting up"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.450727005Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.451427414Z" level=info msg="Loading containers: start."
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.529980377Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP 
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.530549152Z" level=error msg="Failed to set bridge MTU docker0 via netlink" error="invalid argument"
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: time="2024-02-02T13:42:22.532190944Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
Feb 02 13:42:22 test120-searchengine.novalocal dockerd[26622]: failed to start daemon: Error initializing network controller: error creating default "bridge" network: invalid argument
Feb 02 13:42:22 test120-searchengine.novalocal systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Feb 02 13:42:22 test120-searchengine.novalocal systemd[1]: Failed to start Docker Application Container Engine.

Removing /etc/docker/daemon.json or simply commenting out the mtu variable (as docker_use_ipv4_nic_mtu: false) suffices to restart the Docker service. But docker ps fails with

[sbesson@test120-searchengine ~]$ sudo docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

The version of Docker is

[sbesson@test120-searchengine ~]$ docker -v
Docker version 25.0.2, build 29cf629

while on a recent successful environment, it is

[sbesson@prod120-searchengine ~]$ docker -v
Docker version 24.0.7, build afdd53b
sbesson commented 10 months ago

Forcing the Docker version to 24.0.7

diff --git a/ansible/idr-docker.yml b/ansible/idr-docker.yml
index 2a53643..e87fc6a 100644
--- a/ansible/idr-docker.yml
+++ b/ansible/idr-docker.yml
@@ -6,7 +6,7 @@
   roles:
     - role: ome.docker
       docker_use_ipv4_nic_mtu: True
-
+      docker_version: 24.0.7
   tasks:
   - name: install docker-python
     become: yes

seems to be sufficient to make progress with the playbook. So I suspect some upstream changes incompatible with our way to deploy Docker using ome.docker.

sbesson commented 10 months ago

https://github.com/moby/moby/issues/47308 looks related and is expected to be resolved with Docker 25.0.3 (or the migration to Rocky Linux 9)

jburel commented 9 months ago

When testing devspace using the testing RHEL 9 VM, I had to edit the dockerd file What is currently in is ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock and it is expecting ExecStart=/usr/bin/dockerd -H unix:///var/run/docker.sock --containerd=/run/containerd/containerd.sock Note that i did not have the issue on the physical RHEL 9 machine

jburel commented 9 months ago

Downgrading to 24.x version might also solve the problem I have when running devspace (omero-server takes a long time to start). I am currently running

docker --version
Docker version 25.0.2, build 29cf629
sbesson commented 9 months ago

I was able to spin up test120 on Friday by downgrading Docker to the last 24.x version. Pushed https://github.com/IDR/deployment/commit/825c70b6b778e76d9efbee0aea492d967c0f3753 accordingly so that we unblock the creation of production & pilot environments. Once Docker 25.0.3 is released or we migrate to Rocky Linux 9, we can evaluate dropping the version pinning.