containers / crun

A fast and lightweight fully featured OCI runtime and C library for running containers
GNU General Public License v2.0
3.02k stars 307 forks source link

Error: OCI runtime error: openat2 `etc/localtime`: Resource temporarily unavailable #725

Closed mbaldessari closed 3 years ago

mbaldessari commented 3 years ago

Hi Giuseppe,

I've been trying to port OSP to centos stream 9 and noticed a sporadic issue when crun is the container engine.

Namely, at random, when I spawn a bunch of containers during the deployment, sometimes one of them will fail with the following error: Aug 30 05:42:09 undercloud-0.localdomain ansible-tripleo_container_manage[14274]: [WARNING] ERROR: Can't run container container-puppet-crond stderr: Error: OCI runtime error: openat2 etc/localtime: Resource temporarily unavailable

The spawned container looks like the following: Aug 30 05:42:09 undercloud-0.localdomain ansible-tripleo_container_manage[14274]: PODMAN-CONTAINER-DEBUG: podman run --name container-puppet-crond --conmon-pidfile /var/run/container-puppet-crond.pid --detach=False --entrypoint /var/lib/container-puppet/container-puppet.sh --env STEP=6 --env NET_HOST=true --env DEBUG=true --env HOSTNAME=undercloud-0 --env NO_ARCHIVE= --env PUPPET_TAGS=file,file_line,concat,augeas,cron --env NAME=crond --env STEP_CONFIG=include\n::tripleo::packages\ninclude tripleo::profile::base::logging::logrotate' --label config_id=tripleo_puppet_step1 --label container_name=container-puppet-crond --label managed_by=tripleo_ansible --label config_data={'security_opt': ['label=disable'], 'user': 0, 'detach': False, 'entrypoint': '/var/lib/container-puppet/container-puppet.sh', 'environment': {'STEP': 6, 'NET_HOST': 'true', 'DEBUG': 'true', 'HOSTNAME': 'undercloud-0', 'NO_ARCHIVE': '', 'PUPPET_TAGS': 'file,file_line,concat,augeas,cron', 'NAME': 'crond', 'STEP_CONFIG': 'include ::tripleo::packages\ninclude tripleo::profile::base::logging::logrotate'}, 'net': ['host'], 'image': 'quay.io/tripleomaster/openstack-cron:current-tripleo', 'volumes': ['/dev/log:/dev/log:rw', '/etc/hosts:/etc/hosts:ro', '/etc/localtime:/etc/localtime:ro', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '/etc/puppet:/tmp/puppet-etc:ro', '/usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro', '/var/lib/config-data:/var/lib/config-data:rw', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '/var/lib/container-puppet/puppetlabs:/opt/puppetlabs:ro']} --log-driver k8s-file --log-opt path=/var/log/containers/stdouts/container-puppet-crond.log --network host --security-opt label=disable --user 0 --volume /dev/log:/dev/log:rw --volume /etc/hosts:/etc/hosts:ro --volume /etc/localtime:/etc/localtime:ro --volume /etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume /etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume /etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume /etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume /etc/puppet:/tmp/puppet-etc:ro --volume /usr/share/openstack-puppet/modules:/usr/share/openstack-puppet/modules:ro --volume /var/lib/config-data:/var/lib/config-data:rw --volume /var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro --volume /var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro --volume /var/lib/container-puppet/puppetlabs:/opt/puppetlabs:ro quay.io/tripleomaster/openstack-cron:current-tripleo

Note that this will likely happen once to a random container (i.e. it's not always the same one). The crun version I use is: crun-0.21-4.module_el9+99+07a5c500.x86_64 [root@undercloud-0 ~]# crun --version crun version 0.21 commit: c4c3cdf2ce408ed44a9e027c618473e6485c635b spec: 1.0.0 +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL

If I switch the runtime engine to runc, I never hit this issue, so I am assuming this is crun specific. I also patched my deployment to run all podman commands with --syslog --log-level debug, but so far I have been unable to reproduce it with that extra debug level, which, I suspect, sort of implies that it changes the timing and avoids the race altogether.

In the next days I'll try to come up with a more self-contained reproducer.

If you have any tips/thoughts in the meantime, do let me know ;)

giuseppe commented 3 years ago

Thanks for opening the issue. It looks like openat2 can fail with EAGAIN:

EAGAIN how.resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could not ensure that a
              ".." component didn't escape (due to a race condition or potential attack).  The caller may choose  to
              retry the openat2() call.

I think we need something like:

$ git diff
diff --git a/src/libcrun/utils.c b/src/libcrun/utils.c
index e5db288..94d99d4 100644
--- a/src/libcrun/utils.c
+++ b/src/libcrun/utils.c
@@ -348,7 +348,7 @@ safe_openat (int dirfd, const char *rootfs, size_t rootfs_len, const char *path,

   if (openat2_supported)
     {
-      ret = syscall_openat2 (dirfd, path, flags, mode, RESOLVE_IN_ROOT);
+      ret = TEMP_FAILURE_RETRY (syscall_openat2 (dirfd, path, flags, mode, RESOLVE_IN_ROOT));
       if (ret < 0)
         {
           if (errno == ENOSYS)
giuseppe commented 3 years ago

opened a PR: https://github.com/containers/crun/pull/726