cloudfoundry / bpm-release

isolated bosh jobs
Apache License 2.0
34 stars 28 forks source link

Rootless garden does not work with BPM 1.1.1 #128

Closed danail-branekov closed 3 years ago

danail-branekov commented 5 years ago

Hi BPM, Garden here.

TL;DR Rootless garden bosh-lite CF deployment fails with cf-deployment 11.0.0 and rootless containers enabled. The failure is caused by the garden job failing to start. We think that this is a regression in BPM 1.1.1, hence this bugreport. Note that the failure only occurs on a new deployment, or if the diego cell is recreated during the deployment (because of i.e. a stemcell bump), or the cell is recreated after successful deployment.

Long version: The error that causes the garden job to fails while starting is running this line in its prestart script and the error is:

+ grootfs --config /var/vcap/jobs/garden/config/grootfs_config.yml init-store --uid-mapping 0:4294967294:1 --uid-mapping 1:65536:4294901758 --gid-mapping 0:4294967294:1 --gid-mapping 1:65536:4294901758
exit status 32: mount: /var/vcap/data/grootfs/store/unprivileged.backing-store: failed to setup loop device: Operation not permitted

Looking at the grootfs code, this error happens when mounting the store backing file onto the store path via a loopback mount. If we execute and strace the mount command ourselves on the Garden BPM container on the diego cell we get the following:

root@5e45bb71-2313-4066-bd79-f816228463eb:/var/vcap/jobs/garden# strace mount -o loop,pquota,noatime -t xfs /var/vcap/data/grootfs/store/unprivileged.backing-store /var/vcap/data/grootfs/store/unprivileged
....
stat("/dev/loop4", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 4), ...}) = 0
stat("/dev/loop4", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 4), ...}) = 0
open("/sys/dev/block/7:4", O_RDONLY|O_CLOEXEC) = 3
openat(3, "loop/offset", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
lstat("/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/var/vcap", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/var/vcap/data", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/var/vcap/data/grootfs", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/var/vcap/data/grootfs/store", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat("/var/vcap/data/grootfs/store/unprivileged.backing-store", {st_mode=S_IFREG|0600, st_size=51398832128, ...}) = 0
open("/var/vcap/data/grootfs/store/unprivileged.backing-store", O_RDWR|O_CLOEXEC) = 4
open("/dev/loop4", O_RDWR|O_CLOEXEC)    = -1 EPERM (Operation not permitted)
.....

The error that we get without stracing is

root@5e45bb71-2313-4066-bd79-f816228463eb:/var/vcap/jobs/garden# mount -o loop,pquota,noatime -t xfs /var/vcap/data/grootfs/store/unprivileged.backing-store /var/vcap/data/grootfs/store/unprivileged
mount: /var/vcap/data/grootfs/store/unprivileged.backing-store: failed to setup loop device: Operation not permitted

It is also worth noticing that catting the loop device fails:

root@5e45bb71-2313-4066-bd79-f816228463eb:/var/vcap/jobs/garden# cat /dev/loop4
cat: /dev/loop4: Operation not permitted

It seems that loop devices are not permitted to the Garden BPM container root, which is quite odd provided their permissions

root@5e45bb71-2313-4066-bd79-f816228463eb:/var/vcap/jobs/garden# ls -la /dev | grep loop | head -n 5
brw-rw----  1 root root 7,   0 Aug 20 13:08 loop0
brw-rw----  1 root root 7,   1 Aug 20 13:08 loop1
brw-rw----  1 root root 7,  10 Aug 20 13:08 loop10
brw-rw----  1 root root 7, 100 Aug 20 13:08 loop100
brw-rw----  1 root root 7, 101 Aug 20 13:08 loop101

While trying to figure out what is going on, we came across a couple of lxd issues (issue1, issue2) which say that creating loop devices in unprivileged containers via mknod succeeds, but they cannot be used. As a matter of fact, garden job does create 256 loop devices in its start ctl script but it is very unlikely those lxd issues to be related because a) creating loop devices has been around for forever and it worked b) the Garden BPM container is a privileged one.

Being clueless, we started with the assumption that there is a Garden regression so we started bisecting Garden git history. It turned out that the first commit reproducing the issue is ba9d17abf01165fbdcdcae460d17a427b3cfd2ca which makes additional bpm volumes shared and writable. Nevertheless, keeping this commit accountable for the failure does not make sense because it is quite old (since beginning of June, i.e. garden-runc 1.19.3 and cf-deployment 9.4.0) and rootless worked fine back then.

Therefore we started the other way round, we tried deploying cf-deployment 11.0.0 with BPM downgraded to 1.1.0 and the deployment succeeded. From our point of view this means that there is a regression in shared/writable volumes with BPM 1.1.1. Looking at the BPM commit history, we thought that the only commit that has anything to do with shared volumes is 2ab67a73b60a40788abc2b8c1455356e65704271. Indeed, deploying BPM with the previous commit checked out did not reproduce the issue.

To make things even weirder, we tried to come up with a minimal bosh-lite setup with rootless garden and bpm only where we mimic the additional bpm volumes described in cf-deployment but we could not reproduce the problem. It seems that there is something CF-specific that makes the issue manifest itself.

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/168000336

The labels on this github issue will be updated when the story is started.

xoebus commented 5 years ago

Maybe this happens if there are multiple bosh-lite deployed jobs running Garden? That would explain why it only happens in a cf deployment.

danail-branekov commented 5 years ago

Hey @xoebus

Our deployment is a bosh-lite one indeed but it only has a single diego-cell, i.e. there is a single garden job. For the sake of the experiment we tried a local bosh-lite deployment with 2 VMS, each one of them running the garden job rootlessly in a BPM container. Deployment was successful and we verified that shared mounts work as expected. Therefore it seems that number of Garden jobs is irrelevant.

danail-branekov commented 5 years ago

Trying to be helpful, we kept digging to get some clues on what the issue actually is via removing jobs from a diego-cell until the symptom changes. Once we got rid of rep the issue was gone.

Some background context, on a CF diego-cell rep and garden share a directory (/var/vcap/data/rep/shared/garden) where rep puts the cell certificates and garden bind mounts them into application containers. Prior BPM shared volumes we used to declare /var/vcap/data/rep/shared/garden as an additional BPM volume and BOTH garden and rep took care to create it and make it a shared bind mount (rep, garden). Both jobs are doing the same thing because a)their start order is not guaranteed and b)the directory must exist in order both jobs to be able to start After it became clear that introducing shared mount volumes in BPM causes the current issue, we thought that rep and garden should not be doing the mount themselves, instead they should delegate this to BPM (since /var/vcap/data/rep/shared/garden is a mounted shared volume). That is why we got rid of the shared mount thing and let rep just create the directory in its bpm-pre-start and let BPM do the rest (make it a shared mount and mount it into the garden BPM container). Unfortunately this experiment was not successful, we still kept getting the same error when mounting the groot store.

In the end we managed to come up with a minimal bosh release which mimics rep and garden jobs interaction described above. Note that these dummy jobs have nothing to do with neither garden, nor diego code, their names are illustrative. The release has two jobs:

Deploying this release (via scripts/deploy-lite-rootless.sh) reproduces the issue. What we see is that:

  1. garden job starts first, initially it fails, we believe because the shared mounted BPM volume directory does not exist yet (rep is not started yet). Here is the related error in bpm.log:
    
    {"timestamp":"2019-09-04T12:41:24.275482451Z","level":"info","source":"bpm","message":"bpm.start.acquiring-lifecycle-lock.starting","data":{"job":"garden","process":"garden","session":"1.1"}}
    {"timestamp":"2019-09-04T12:41:24.275625449Z","level":"info","source":"bpm","message":"bpm.start.acquiring-lifecycle-lock.complete","data":{"job":"garden","process":"garden","session":"1.1"}}
    {"timestamp":"2019-09-04T12:41:24.275641448Z","level":"info","source":"bpm","message":"bpm.start.starting","data":{"job":"garden","process":"garden","session":"1"}}
    {"timestamp":"2019-09-04T12:41:24.284547020Z","level":"info","source":"bpm","message":"bpm.start.start-process.starting","data":{"job":"garden","process":"garden","session":"1.2"}}
    {"timestamp":"2019-09-04T12:41:24.284653783Z","level":"info","source":"bpm","message":"bpm.start.start-process.creating-job-prerequisites","data":{"job":"garden","process":"garden","session":"1.2"}}
    {"timestamp":"2019-09-04T12:41:24.285122996Z","level":"info","source":"bpm","message":"bpm.start.start-process.complete","data":{"job":"garden","process":"garden","session":"1.2"}}
    {"timestamp":"2019-09-04T12:41:24.285152091Z","level":"error","source":"bpm","message":"bpm.start.failed-to-start","data":{"error":"failed to create system files: no such file or directory","job":"garden","process":"garden","session":"1"}}
    {"timestamp":"2019-09-04T12:41:24.285164780Z","level":"info","source":"bpm","message":"bpm.start.complete","data":{"job":"garden","process":"garden","session":"1"}}

2. After `garden` failed to start, `rep` starts and creates the directory
3. Now `garden` job is clear to start, here is the bpm log:

{"timestamp":"2019-09-04T12:41:55.831363962Z","level":"info","source":"bpm","message":"bpm.start.acquiring-lifecycle-lock.starting","data":{"job":"garden","process":"garden","session":"1.1"}} {"timestamp":"2019-09-04T12:41:55.831463893Z","level":"info","source":"bpm","message":"bpm.start.acquiring-lifecycle-lock.complete","data":{"job":"garden","process":"garden","session":"1.1"}} {"timestamp":"2019-09-04T12:41:55.831478942Z","level":"info","source":"bpm","message":"bpm.start.starting","data":{"job":"garden","process":"garden","session":"1"}} {"timestamp":"2019-09-04T12:41:55.837603349Z","level":"info","source":"bpm","message":"bpm.start.start-process.starting","data":{"job":"garden","process":"garden","session":"1.2"}} {"timestamp":"2019-09-04T12:41:55.837698012Z","level":"info","source":"bpm","message":"bpm.start.start-process.creating-job-prerequisites","data":{"job":"garden","process":"garden","session":"1.2"}} {"timestamp":"2019-09-04T12:41:55.838227829Z","level":"info","source":"bpm","message":"bpm.start.start-process.building-spec","data":{"job":"garden","process":"garden","session":"1.2"}} {"timestamp":"2019-09-04T12:41:55.838438853Z","level":"info","source":"bpm","message":"bpm.start.start-process.creating-bundle","data":{"job":"garden","process":"garden","session":"1.2"}} {"timestamp":"2019-09-04T12:41:55.842325219Z","level":"info","source":"bpm","message":"bpm.start.start-process.running-container","data":{"job":"garden","process":"garden","session":"1.2"}} {"timestamp":"2019-09-04T12:41:55.919018004Z","level":"info","source":"bpm","message":"bpm.start.start-process.complete","data":{"job":"garden","process":"garden","session":"1.2"}} {"timestamp":"2019-09-04T12:41:55.919089961Z","level":"info","source":"bpm","message":"bpm.start.complete","data":{"job":"garden","process":"garden","session":"1"}} {"timestamp":"2019-09-04T12:41:55.919104939Z","level":"info","source":"bpm","message":"bpm.start.releasing-lifecycle-lock.starting","data":{"job":"garden","process":"garden","session":"1.3"}} {"timestamp":"2019-09-04T12:41:55.919116759Z","level":"info","source":"bpm","message":"bpm.start.releasing-lifecycle-lock.complete","data":{"job":"garden","process":"garden","session":"1.3"}}

4. However, `garden`'s start script failed when executing the [`mount command`](https://github.com/masters-of-cats/dummy-release/blob/963f33c7e76a4fecb2aee8295ee3b9c5fc56dba0/jobs/garden/templates/bin/garden_start#L22) and the error is identical to the one that we see on the CF deployment:

dummy/5ca1c1d6-7868-4670-bdb1-8e21a4333634:~# cat /var/vcap/sys/log/garden/garden.stderr.log [WARN tini (7)] Tini is not running as PID 1 and isn't registered as a child subreaper. Zombie processes will not be re-parented to Tini, so zombie reaping won't work. To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1. mount: /var/vcap/data/store/groot.store: failed to setup loop device: Operation not permitted



By changing the mount command in the start script to `strace mount ...` you can see that the error comes from a problem with /dev/loop[X], which seems to be the root of the problem.
xoebus commented 5 years ago

Thank you for digging into this!

I'm sorry, I don't have time to look into this further right now 😞. Would you be open to submitting a PR to fix this?

danail-branekov commented 5 years ago

Unfortunately I have no clue what the problem is, I assume it needs quite some investigation. Alas, rootless garden is being currently deprioritised, not sure whether we would be able to work on this soon :(

bosh-admin-bot commented 3 years ago

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

bosh-admin-bot commented 3 years ago

This issue was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.