Closed brian-peloton closed 7 years ago
Interesting! I'll try to reproduce this and see if I can come up with a solution somehow, but I probably won't have time for it this week (and I'm on vacation next week). :(
We have noticed reliability issues with the default kernel of Ubuntu 14.04 LTS, which I think is 3.13, as well. The issue is probably not the same issue, as we could never reproduce it on demand (but it seemed like somehow the system got stuck into a state where sandboxing from then on would fail and only a reboot would make it work again). But with the newer 4.x kernel available from the official Ubuntu repo, I never saw these or other issues with the sandbox.
I'm seeing the same issue on this Docker image: https://hub.docker.com/r/gerritforge/gerrit-ci-slave-bazel
. I'm using openSUSE 42.
To reproduce:
$ docker run -ti --entrypoint=/bin/bash gerritforge/gerrit-ci-slave-bazel
$ su - jenkins
$ git clone --recursive https://gerrit.googlesource.com/gerrit
$ bazel build gerrit
INFO: Found 1 target...
ERROR: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/external/jsonevent_layout/jar/BUILD:2:1: Extracting interface @jsonevent_layout//jar:jar failed: linux-sandbox failed: error executing command /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/_bin/linux-sandbox ... (remaining 5 argument(s) skipped).
src/main/tools/linux-sandbox-pid1.cc:393: "mount": Operation not permitted
Target //:gerrit failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 23.554s, Critical Path: 0.88s
Upgrading to Bazel 0.4.0 didn't help either. Here is log with debug sanbdox option enabled: [1].
Environment:
$ bazel info
bazel-bin: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out/local-fastbuild/bin
bazel-genfiles: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out/local-fastbuild/genfiles
bazel-testlogs: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out/local-fastbuild/testlogs
command_log: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/command.log
committed-heap-size: 990MB
execution_root: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit
gc-count: 9
gc-time: 259ms
install_base: /home/jenkins/.cache/bazel/_bazel_jenkins/install/0cc4b236e213b245b1e75e931bb2c011
max-heap-size: 7398MB
message_log: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/message.log
output_base: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982
output_path: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out
package_path: %workspace%
release: release 0.4.0
server_pid: 1270
used-heap-size: 606MB
workspace: /home/jenkins/projects/gerrit
jenkins@68fab8fdcf00:~/projects/gerrit$ uname -a
Linux 68fab8fdcf00 4.1.34-33-default NVIDIA/nvidia-docker#1 SMP PREEMPT Thu Oct 20 08:03:29 UTC 2016 (fe18aba) x86_64 x86_64 x86_64 GNU/Linux
I'll try to reproduce & fix this, but currently I have no idea what could cause the mounting of /proc to fail. :(
We were able to fix the problem by starting Docker vm with some options.
I ran into the same problem. and it looks like a kernel compatibility issue. apt-get dist-upgrade (on ubuntu 14.04) fixed the problem.
3.16.0-77-generic NVIDIA/nvidia-docker#99~14.04.1-Ubuntu failed
4.4.0-53-generic NVIDIA/nvidia-docker#74~14.04.1-Ubuntu works
I'm pretty sure it's a kernel version-related issue too.
@davido: What options made it work? Also, what kernel are you using?
It was --priviledged
: [1].
Kernel here is:
$ uname -a
Linux linux-ucwl.site 4.1.34-33-default NVIDIA/nvidia-docker#1 SMP PREEMPT Thu Oct 20 08:03:29 UTC 2016 (fe18aba) x86_64 x86_64 x86_64 GNU/Linux
Seeing the same error in a Archlinux LXC container running on Proxmox (Debian Jessie kernel)
$ uname -a Linux machinelearning 4.4.35-2-pve NVIDIA/nvidia-docker#1 SMP Mon Jan 9 10:21:44 CET 2017 x86_64 GNU/Linux
--- Build logs Build successful! Binary is here: /pkg/makepkg/bazel/src/output/bazel Extracting Bazel installation... ...... INFO: Found 1 target... ERROR: /pkg/makepkg/bazel/src/src/main/native/BUILD:1:1: Executing genrule //src/main/native:copy_link_jni_md_header failed: linux-sandbox failed: error executing command /home/ml/.cache/bazel/_bazel_ml/6ae2aecfa6ff1003adffee270b604ad9/execroot/src/_bin/linux-sandbox ... (remaining 5 argument(s) skipped). src/main/tools/linux-sandbox-pid1.cc:88: "mount": Permission denied Target //scripts:bazel-complete.bash failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 2.972s, Critical Path: 0.14s
Edit: Proxmox ISOs are here: https://www.proxmox.com/en/downloads
Also seeing this error under CircleCI's docker containers:
Within the CircleCI container:
(venv-3.4.3) ubuntu@box260:~/code$ uname -a
Linux box260.localdomain 3.13.0-106-generic NVIDIA/nvidia-docker#153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Console output:
(venv-3.4.3) ubuntu@box260:~/code$ bazel test test/... --verbose_failures --sandbox_debug INFO: Found 3 test targets... ERROR: /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/external/org_jooq_jool/jar/BUILD:2:1: Extracting interface @org_jooq_jool//jar:jar failed: linux-sandbox failed: error executing command (cd /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/execroot/code && \ exec env - \ PATH=/home/ubuntu/.yarn/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/google-cloud-sdk/bin:/home/ubuntu/virtualenvs/venv-3.4.3/bin:/opt/ghc/8.0.1/bin:/opt/cabal/1.24/bin:/opt/alex/3.1.7/bin:/opt/happy/1.19.5/bin:/home/ubuntu/.composer/vendor/bin:/opt/circleci/.phpenv/shims:/opt/circleci/.phpenv/bin:/opt/circleci/.rvm/gems/ruby-2.2.6/bin:/opt/circleci/.rvm/gems/ruby-2.2.6@global/bin:/opt/circleci/.rvm/rubies/ruby-2.2.6/bin:/home/ubuntu/.go_workspace/bin:/usr/local/go/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/circleci/.pyenv/shims:/opt/circleci/.pyenv/bin:/usr/local/android-sdk-linux/platform-tools:/usr/local/android-sdk-linux/tools:/usr/local/apache-maven/bin:/home/ubuntu/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/gradle-1.10/bin:/opt/circleci/.rvm/bin:/opt/circleci/.rvm/bin \ /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/execroot/code/_bin/linux-sandbox @/home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/linux-sandbox.params -- external/bazel_tools/tools/jdk/ijar/ijar external/org_jooq_jool/jar/jool-0.9.12.jar bazel-out/local-fastbuild/genfiles/external/org_jooq_jool/jar/_ijar/jar/external/org_jooq_jool/jar/jool-0.9.12-ijar.jar). src/main/tools/linux-sandbox.cc:183: linux-sandbox-pid1 has PID 45135 src/main/tools/linux-sandbox-pid1.cc:88: "mount": Permission denied src/main/tools/linux-sandbox.cc:223: child exited normally with exitcode 1 ERROR: /home/ubuntu/code/BUILD:1:1 Extracting interface @org_jooq_jool//jar:jar failed: linux-sandbox failed: error executing command (cd /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/execroot/code && \ exec env - \ PATH=/home/ubuntu/.yarn/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/google-cloud-sdk/bin:/home/ubuntu/virtualenvs/venv-3.4.3/bin:/opt/ghc/8.0.1/bin:/opt/cabal/1.24/bin:/opt/alex/3.1.7/bin:/opt/happy/1.19.5/bin:/home/ubuntu/.composer/vendor/bin:/opt/circleci/.phpenv/shims:/opt/circleci/.phpenv/bin:/opt/circleci/.rvm/gems/ruby-2.2.6/bin:/opt/circleci/.rvm/gems/ruby-2.2.6@global/bin:/opt/circleci/.rvm/rubies/ruby-2.2.6/bin:/home/ubuntu/.go_workspace/bin:/usr/local/go/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/circleci/.pyenv/shims:/opt/circleci/.pyenv/bin:/usr/local/android-sdk-linux/platform-tools:/usr/local/android-sdk-linux/tools:/usr/local/apache-maven/bin:/home/ubuntu/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/gradle-1.10/bin:/opt/circleci/.rvm/bin:/opt/circleci/.rvm/bin \ /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/execroot/code/_bin/linux-sandbox @/home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/linux-sandbox.params -- external/bazel_tools/tools/jdk/ijar/ijar external/org_jooq_jool/jar/jool-0.9.12.jar bazel-out/local-fastbuild/genfiles/external/org_jooq_jool/jar/_ijar/jar/external/org_jooq_jool/jar/jool-0.9.12-ijar.jar). INFO: Elapsed time: 7.713s, Critical Path: 2.15s Executed 0 out of 3 tests: 1 fails to build and 2 were skipped.
Small update. I tried deactivating AppArmor for my LXC container with
lxc.aa_profile = unconfined
I still get the Operation not permitted issue while building Bazel
I manage to build bazel itself in a LXC container by deactivating the sandboxing altogether with: --strategy=Genrule=standalone --spawn_strategy=standalone added to the bazel build line
On my debian system I had to modify bazel as follow: @@ -402,8 +404,9 @@ static void MakeFilesystemMostlyReadOnly() { static void MountProc() { // Mount a new proc on top of the old one, because the old one still refers to // our parent PID namespace.
but I don't know what are the genera implication of this nor how to check that is not breaking anything. I'd appreciate if somebody familiar with sandboxing would take this and check, otherwise I'll try a PR over the weekend.
I went to write up a patch doing that, and it turns out it doesn't actually work... You still end up with the wrong PIDs on /proc.
Turns out the root cause isn't the kernel version; it's actually what you have mounted in /proc. In my case, it's /proc/xen. projectatomic/bubblewrap#134 and opencontainers/runc#252 both reference the same issue.
However, you can work around it by unmounting /proc/xen in a privileged mount namespace first:
brian[16259] dev-builder ~
$ sudo unshare --mount --propagation private
root[875] dev-builder /home2/brian
# umount /proc/xen
root[876] dev-builder /home2/brian
# su brian
brian[16264] dev-builder ~
$ unshare --fork --pid --mount --map-root-user
root[16264] dev-builder ~
# mount -t proc proc /proc
That workaround does require privileges, but you could in theory do it before spawning the login shell or something. I think I'm going to just unmount /proc/xen system-wide because it's for compatibility and it looks like my systems don't have anything, but there are options.
Given that it looks like this is a kernel/system issue and not really a Bazel issue, and c2d773e made it fail gracefully, I'm going to close this now. I'll send out the test case I wrote to catch /proc being wrong with @tsuri's idea to make it more obvious that it doesn't work if anybody else tries it in the future.
Has anyone applied the workaround successfully? Say I start with
https://hub.docker.com/r/insready/bazel/
docker run -it --rm insready/bazel
I haven't been able to fix the /proc mountpoint so that bazel sandboxing works.
(It would be extra cool if the Bazel team maintained a docker image so it would be easy to run bazel builds on CI like Circle)
Yes, see my comment from "Dec 14, 2016": Workaround is to pass --priviledged
option to docker
command.
I don't think that works in CI environments where you don't run the container yourself. See https://discuss.circleci.com/t/option-to-run-docker-with-privileged-on-circle-2-0/12377
@alexeagle The container builder team has gcr.io/cloud-builders/bazel
.
It's built from: https://github.com/GoogleCloudPlatform/cloud-builders/
Sorry to comment on this stale thread. But we hit the same issue of linux-sandbox being unavailable when running bazel inside a docker container. Root of the problem stems from Nvidia although.
Problem: Due to Nvidia Runtime Mounting Proc, when running bazel within a docker container, we hit
src/main/tools/linux-sandbox-pid1.cc:441: "mount": Operation not permitted
We see that there's a nested proc mount
unshare --mount --map-root-user --pid --fork
# mount | grep proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
proc on /proc/driver/nvidia/gpus/0000:b3:00.0 type proc (ro,nosuid,nodev,noexec,relatime)
Whilst I know this is nvidia problem and limited to local execution, it would be nice to be able to use linux-sandbox within a docker container w/ access to Nvidia runtime.
Proposal: Applying the Recursive Bind option from @tsuri, we fix this issue https://github.com/bazelbuild/bazel/pull/18069. Wondering if we can get someone to review this small patch 😅. Would greatly save us complexity from maintaining our own patch
When trying to build anything with the new sandbox and Debian Jessie's amd64 default 3.16.0-4 kernel, it fails with
src/main/tools/linux-sandbox-pid1.cc:393: "mount": Operation not permitted
. @philsc and I have previously looked for ways to make /proc show the right PIDs in a PID namespace on that kernel without root permission and not come up with anything.I don't have any good answers in the way of solutions. asan definitely does not do well with a broken /proc (that's what @philsc and I were working on previously, although we ran into other, more fundamental issues and gave up), and from what I've seen of java it won't either. However, having a PID namespace is really nice for preventing runaway processes (I periodically have to use pgrep and manually kill runaway test process with the old sandbox).
These commands show the same issue with that kernel:
Those same commands succeed with 4.3.0-0 kernel from jessie-backports, so I'm pretty sure Bazel's sandbox will too (haven't checked though):
/cc @philwo