NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
649 stars 94 forks source link

Copy /etc/localtime symlink and target instead of mounting #184

Closed lukeyeager closed 7 months ago

lukeyeager commented 8 months ago

With ubuntu 24.04, attempting to install the tzdata package inside an enroot container leads to the following error:

mv: cannot move '/etc/localtime.dpkg-new' to '/etc/localtime': Device or resource busy

The package's postinst script only skips setting the timezone if /etc/localtime already exists as a symlink. So, in order to set the timezone inside the container to match the host os, we must copy both the symlink and the symlink's target into the container rootfs.

See https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/tzdata/tree/debian/tzdata.postinst?h=debian/2024a-1ubuntu1

lukeyeager commented 7 months ago

Tested on ubuntu 20.04, 22.04 and 24.04, centos 7 and 8, and alpine 3.19.

Ubuntu has the annoying habit of canonicalizing the symlink when installing tzdata, so when the container starts up it'll be whatever was on your host (e.g. /usr/share/zoneinfo/US/Central), but then tzdata's postinst script (linked above) will rewrite that to /usr/share/zoneinfo/America/Chicago). And then if you restart the container this hook will set it back to US/Central. That feels a little janky but it seems to work.

krono commented 5 days ago

HI, sorry to re-touch this issue, but it seems that this hook might introduce a race condition, when used with pyxis.

So a user is re-using the same container on different nodes simultaneously, and previosusly, the mount was somewhat fine, cause it happens locally to the machine.

However, copying into the container has the nasty side effect that, given that the container is on a shared file system, another node might just have done the same…

j-hellenberg commented 5 days ago

To add some more context here:

I'm running a sbatch job using a common pre-setup container like

#!/bin/bash
#SBATCH --container-name ubuntu-2310
#SBATCH --array=0-100%10
(more #SBATCH directives)

python myscript.py

so Slurm will schedule executions on multiple nodes in parallel.

The specific error I'm experiencing is

slurmstepd-cXXX: error: pyxis: container start failed with error code: 1
slurmstepd-cXXX: error: pyxis: printing enroot log file:
slurmstepd-cXXX: error: pyxis:     cp: cannot create symbolic link '/<network_share>/.local/share/enroot/pyxis_2310/etc/localtime': File exists
slurmstepd-cXXX: error: pyxis:     [ERROR] /etc/enroot/hooks.d/10-localtime.sh exited with return code 1
slurmstepd-cXXX: error: pyxis: couldn't start container
slurmstepd-cXXX: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd-cXXX: error: Failed to invoke spank plugin stack

However, this error only randomly appears in about 1/3 of executions. Our guess would be that this happens because the lines of 10-localtime.sh are executed multiple times in parallel and interleaved, and depending on which order this happens in, some of them succeed, and some don't.

lukeyeager commented 5 days ago

Whoops! If you're able to reliably reproduce it, can you try simply adding --force to the cp commands in the hook?

lukeyeager commented 5 days ago

I'll let Jon comment on whether we recommend running the same container rootfs across multiple machines (I believe not), but that cp --force thing might unblock you for now?

j-hellenberg commented 4 days ago

Whoops! If you're able to reliably reproduce it, can you try simply adding --force to the cp commands in the hook?

We just tried your suggestion on one of our machines, and that machine was then the only one no longer experiencing errors, so we think that should do the trick :+1:

I'll let Jon comment on whether we recommend running the same container rootfs across multiple machines (I believe not), but that cp --force thing might unblock you for now?

I agree that this approach looks a bit suspicious, and I would not do it this way in any kind of serious production usage. In this case, we are talking about research work, though, and I like the convenience of having a manual change I make in a container immediately affect all future job executions.

In any case, I believe adding the --force flag should increase the resilience of the system without negative side effects here, so doing so probably makes sense even though it is only of relevance in this (maybe not fully supported) edge case? I can open a small PR for that if you want.

3XX0 commented 3 days ago

Yeah we don't usually recommend this since we don't test it. Having said that we do flock the rootfs: https://github.com/NVIDIA/enroot/blob/2bd51434bbc427a1d8463e5682c89bac4b8fda51/src/runtime.sh#L243 So maybe your shared filesystem doesn't support it or is not properly configured

krono commented 3 days ago

It's GPFS, and it does support locking. I'll have a test and will report back

krono commented 1 day ago

I fail to understand what's happening here. I'm making a new issue