bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.21k stars 4.06k forks source link

src/main/tools/linux-sandbox-pid1.cc:393: "mount": Operation not permitted #1972

Closed brian-peloton closed 7 years ago

brian-peloton commented 8 years ago

When trying to build anything with the new sandbox and Debian Jessie's amd64 default 3.16.0-4 kernel, it fails with src/main/tools/linux-sandbox-pid1.cc:393: "mount": Operation not permitted. @philsc and I have previously looked for ways to make /proc show the right PIDs in a PID namespace on that kernel without root permission and not come up with anything.

I don't have any good answers in the way of solutions. asan definitely does not do well with a broken /proc (that's what @philsc and I were working on previously, although we ran into other, more fundamental issues and gave up), and from what I've seen of java it won't either. However, having a PID namespace is really nice for preventing runaway processes (I periodically have to use pgrep and manually kill runaway test process with the old sandbox).

These commands show the same issue with that kernel:

brian[907] dev-builder ~:
$ unshare --mount --map-root-user --pid --fork
root[857] dev-builder ~:
# mount -t proc proc /proc
mount: permission denied
root[857] dev-builder ~:

Those same commands succeed with 4.3.0-0 kernel from jessie-backports, so I'm pretty sure Bazel's sandbox will too (haven't checked though):

brian[17107] brian-debian ~:
$ unshare --mount --map-root-user --pid --fork
root[501] brian-debian ~:
# mount -t proc proc /proc
root[501] brian-debian ~:

/cc @philwo

philwo commented 8 years ago

Interesting! I'll try to reproduce this and see if I can come up with a solution somehow, but I probably won't have time for it this week (and I'm on vacation next week). :(

We have noticed reliability issues with the default kernel of Ubuntu 14.04 LTS, which I think is 3.13, as well. The issue is probably not the same issue, as we could never reproduce it on demand (but it seemed like somehow the system got stuck into a state where sandboxing from then on would fail and only a reboot would make it work again). But with the newer 4.x kernel available from the official Ubuntu repo, I never saw these or other issues with the sandbox.

davido commented 8 years ago

I'm seeing the same issue on this Docker image: https://hub.docker.com/r/gerritforge/gerrit-ci-slave-bazel. I'm using openSUSE 42.

To reproduce:

$ docker run -ti --entrypoint=/bin/bash gerritforge/gerrit-ci-slave-bazel
$ su - jenkins
$ git clone --recursive https://gerrit.googlesource.com/gerrit
$ bazel build gerrit
INFO: Found 1 target...
ERROR: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/external/jsonevent_layout/jar/BUILD:2:1: Extracting interface @jsonevent_layout//jar:jar failed: linux-sandbox failed: error executing command /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/_bin/linux-sandbox ... (remaining 5 argument(s) skipped).
src/main/tools/linux-sandbox-pid1.cc:393: "mount": Operation not permitted
Target //:gerrit failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 23.554s, Critical Path: 0.88s
davido commented 8 years ago

Upgrading to Bazel 0.4.0 didn't help either. Here is log with debug sanbdox option enabled: [1].

Environment:

$ bazel info       
bazel-bin: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out/local-fastbuild/bin
bazel-genfiles: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out/local-fastbuild/genfiles
bazel-testlogs: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out/local-fastbuild/testlogs
command_log: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/command.log
committed-heap-size: 990MB
execution_root: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit
gc-count: 9
gc-time: 259ms
install_base: /home/jenkins/.cache/bazel/_bazel_jenkins/install/0cc4b236e213b245b1e75e931bb2c011
max-heap-size: 7398MB
message_log: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/message.log
output_base: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982
output_path: /home/jenkins/.cache/bazel/_bazel_jenkins/4e8644684552b40c50dc624b79e09982/execroot/gerrit/bazel-out
package_path: %workspace%
release: release 0.4.0
server_pid: 1270
used-heap-size: 606MB
workspace: /home/jenkins/projects/gerrit

jenkins@68fab8fdcf00:~/projects/gerrit$ uname -a
Linux 68fab8fdcf00 4.1.34-33-default NVIDIA/nvidia-docker#1 SMP PREEMPT Thu Oct 20 08:03:29 UTC 2016 (fe18aba) x86_64 x86_64 x86_64 GNU/Linux
philwo commented 7 years ago

I'll try to reproduce & fix this, but currently I have no idea what could cause the mounting of /proc to fail. :(

davido commented 7 years ago

We were able to fix the problem by starting Docker vm with some options.

faithseed commented 7 years ago

I ran into the same problem. and it looks like a kernel compatibility issue. apt-get dist-upgrade (on ubuntu 14.04) fixed the problem.

3.16.0-77-generic NVIDIA/nvidia-docker#99~14.04.1-Ubuntu failed
4.4.0-53-generic NVIDIA/nvidia-docker#74~14.04.1-Ubuntu works
brian-peloton commented 7 years ago

I'm pretty sure it's a kernel version-related issue too.

@davido: What options made it work? Also, what kernel are you using?

davido commented 7 years ago

It was --priviledged: [1].

Kernel here is:

$ uname -a
Linux linux-ucwl.site 4.1.34-33-default NVIDIA/nvidia-docker#1 SMP PREEMPT Thu Oct 20 08:03:29 UTC 2016 (fe18aba) x86_64 x86_64 x86_64 GNU/Linux
mratsim commented 7 years ago

Seeing the same error in a Archlinux LXC container running on Proxmox (Debian Jessie kernel)

$ uname -a Linux machinelearning 4.4.35-2-pve NVIDIA/nvidia-docker#1 SMP Mon Jan 9 10:21:44 CET 2017 x86_64 GNU/Linux

--- Build logs Build successful! Binary is here: /pkg/makepkg/bazel/src/output/bazel Extracting Bazel installation... ...... INFO: Found 1 target... ERROR: /pkg/makepkg/bazel/src/src/main/native/BUILD:1:1: Executing genrule //src/main/native:copy_link_jni_md_header failed: linux-sandbox failed: error executing command /home/ml/.cache/bazel/_bazel_ml/6ae2aecfa6ff1003adffee270b604ad9/execroot/src/_bin/linux-sandbox ... (remaining 5 argument(s) skipped). src/main/tools/linux-sandbox-pid1.cc:88: "mount": Permission denied Target //scripts:bazel-complete.bash failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 2.972s, Critical Path: 0.14s

Edit: Proxmox ISOs are here: https://www.proxmox.com/en/downloads

nornagon commented 7 years ago

Also seeing this error under CircleCI's docker containers:

Within the CircleCI container:

(venv-3.4.3) ubuntu@box260:~/code$ uname -a
Linux box260.localdomain 3.13.0-106-generic NVIDIA/nvidia-docker#153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Console output:

(venv-3.4.3) ubuntu@box260:~/code$ bazel test test/... --verbose_failures --sandbox_debug
INFO: Found 3 test targets...
ERROR: /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/external/org_jooq_jool/jar/BUILD:2:1: Extracting interface @org_jooq_jool//jar:jar failed: linux-sandbox failed: error executing command
  (cd /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/execroot/code && \
  exec env - \
    PATH=/home/ubuntu/.yarn/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/google-cloud-sdk/bin:/home/ubuntu/virtualenvs/venv-3.4.3/bin:/opt/ghc/8.0.1/bin:/opt/cabal/1.24/bin:/opt/alex/3.1.7/bin:/opt/happy/1.19.5/bin:/home/ubuntu/.composer/vendor/bin:/opt/circleci/.phpenv/shims:/opt/circleci/.phpenv/bin:/opt/circleci/.rvm/gems/ruby-2.2.6/bin:/opt/circleci/.rvm/gems/ruby-2.2.6@global/bin:/opt/circleci/.rvm/rubies/ruby-2.2.6/bin:/home/ubuntu/.go_workspace/bin:/usr/local/go/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/circleci/.pyenv/shims:/opt/circleci/.pyenv/bin:/usr/local/android-sdk-linux/platform-tools:/usr/local/android-sdk-linux/tools:/usr/local/apache-maven/bin:/home/ubuntu/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/gradle-1.10/bin:/opt/circleci/.rvm/bin:/opt/circleci/.rvm/bin \
  /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/execroot/code/_bin/linux-sandbox @/home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/linux-sandbox.params -- external/bazel_tools/tools/jdk/ijar/ijar external/org_jooq_jool/jar/jool-0.9.12.jar bazel-out/local-fastbuild/genfiles/external/org_jooq_jool/jar/_ijar/jar/external/org_jooq_jool/jar/jool-0.9.12-ijar.jar).
src/main/tools/linux-sandbox.cc:183: linux-sandbox-pid1 has PID 45135
src/main/tools/linux-sandbox-pid1.cc:88: "mount": Permission denied
src/main/tools/linux-sandbox.cc:223: child exited normally with exitcode 1
ERROR: /home/ubuntu/code/BUILD:1:1 Extracting interface @org_jooq_jool//jar:jar failed: linux-sandbox failed: error executing command
  (cd /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/execroot/code && \
  exec env - \
    PATH=/home/ubuntu/.yarn/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/google-cloud-sdk/bin:/home/ubuntu/virtualenvs/venv-3.4.3/bin:/opt/ghc/8.0.1/bin:/opt/cabal/1.24/bin:/opt/alex/3.1.7/bin:/opt/happy/1.19.5/bin:/home/ubuntu/.composer/vendor/bin:/opt/circleci/.phpenv/shims:/opt/circleci/.phpenv/bin:/opt/circleci/.rvm/gems/ruby-2.2.6/bin:/opt/circleci/.rvm/gems/ruby-2.2.6@global/bin:/opt/circleci/.rvm/rubies/ruby-2.2.6/bin:/home/ubuntu/.go_workspace/bin:/usr/local/go/bin:/opt/circleci/nodejs/v6.5.0/bin:/opt/circleci/.pyenv/shims:/opt/circleci/.pyenv/bin:/usr/local/android-sdk-linux/platform-tools:/usr/local/android-sdk-linux/tools:/usr/local/apache-maven/bin:/home/ubuntu/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/gradle-1.10/bin:/opt/circleci/.rvm/bin:/opt/circleci/.rvm/bin \
  /home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/execroot/code/_bin/linux-sandbox @/home/ubuntu/.cache/bazel/_bazel_ubuntu/185255daeeca84642f8709521495e24f/bazel-sandbox/60d55d3c-50a2-4bb9-a03e-8fb9ffa83e6b-1/linux-sandbox.params -- external/bazel_tools/tools/jdk/ijar/ijar external/org_jooq_jool/jar/jool-0.9.12.jar bazel-out/local-fastbuild/genfiles/external/org_jooq_jool/jar/_ijar/jar/external/org_jooq_jool/jar/jool-0.9.12-ijar.jar).
INFO: Elapsed time: 7.713s, Critical Path: 2.15s

Executed 0 out of 3 tests: 1 fails to build and 2 were skipped.
mratsim commented 7 years ago

Small update. I tried deactivating AppArmor for my LXC container with lxc.aa_profile = unconfined

I still get the Operation not permitted issue while building Bazel

mratsim commented 7 years ago

I manage to build bazel itself in a LXC container by deactivating the sandboxing altogether with: --strategy=Genrule=standalone --spawn_strategy=standalone added to the bazel build line

tsuri commented 7 years ago

On my debian system I had to modify bazel as follow: @@ -402,8 +404,9 @@ static void MakeFilesystemMostlyReadOnly() { static void MountProc() { // Mount a new proc on top of the old one, because the old one still refers to // our parent PID namespace.

but I don't know what are the genera implication of this nor how to check that is not breaking anything. I'd appreciate if somebody familiar with sandboxing would take this and check, otherwise I'll try a PR over the weekend.

brian-peloton commented 7 years ago

I went to write up a patch doing that, and it turns out it doesn't actually work... You still end up with the wrong PIDs on /proc.

Turns out the root cause isn't the kernel version; it's actually what you have mounted in /proc. In my case, it's /proc/xen. projectatomic/bubblewrap#134 and opencontainers/runc#252 both reference the same issue.

However, you can work around it by unmounting /proc/xen in a privileged mount namespace first:

brian[16259] dev-builder ~
$ sudo unshare --mount --propagation private
root[875] dev-builder /home2/brian
# umount /proc/xen
root[876] dev-builder /home2/brian
# su brian
brian[16264] dev-builder ~
$ unshare --fork --pid --mount --map-root-user
root[16264] dev-builder ~
# mount -t proc proc /proc

That workaround does require privileges, but you could in theory do it before spawning the login shell or something. I think I'm going to just unmount /proc/xen system-wide because it's for compatibility and it looks like my systems don't have anything, but there are options.

Given that it looks like this is a kernel/system issue and not really a Bazel issue, and c2d773e made it fail gracefully, I'm going to close this now. I'll send out the test case I wrote to catch /proc being wrong with @tsuri's idea to make it more obvious that it doesn't work if anybody else tries it in the future.

alexeagle commented 7 years ago

Has anyone applied the workaround successfully? Say I start with https://hub.docker.com/r/insready/bazel/ docker run -it --rm insready/bazel I haven't been able to fix the /proc mountpoint so that bazel sandboxing works.

(It would be extra cool if the Bazel team maintained a docker image so it would be easy to run bazel builds on CI like Circle)

davido commented 7 years ago

Yes, see my comment from "Dec 14, 2016": Workaround is to pass --priviledged option to docker command.

alexeagle commented 7 years ago

I don't think that works in CI environments where you don't run the container yourself. See https://discuss.circleci.com/t/option-to-run-docker-with-privileged-on-circle-2-0/12377

mattmoor commented 7 years ago

@alexeagle The container builder team has gcr.io/cloud-builders/bazel.

mattmoor commented 7 years ago

It's built from: https://github.com/GoogleCloudPlatform/cloud-builders/

Ryang20718 commented 1 year ago

Sorry to comment on this stale thread. But we hit the same issue of linux-sandbox being unavailable when running bazel inside a docker container. Root of the problem stems from Nvidia although.

Problem: Due to Nvidia Runtime Mounting Proc, when running bazel within a docker container, we hit

src/main/tools/linux-sandbox-pid1.cc:441: "mount": Operation not permitted

We see that there's a nested proc mount

unshare --mount --map-root-user --pid --fork
# mount | grep proc
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=555,inode64)
proc on /proc/driver/nvidia/gpus/0000:b3:00.0 type proc (ro,nosuid,nodev,noexec,relatime)

Whilst I know this is nvidia problem and limited to local execution, it would be nice to be able to use linux-sandbox within a docker container w/ access to Nvidia runtime.

Proposal: Applying the Recursive Bind option from @tsuri, we fix this issue https://github.com/bazelbuild/bazel/pull/18069. Wondering if we can get someone to review this small patch 😅. Would greatly save us complexity from maintaining our own patch