NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Docker image build fails with unprivileged user namespace isolation #85

Open lahwaacz opened 1 year ago

lahwaacz commented 1 year ago

I have Docker configured with unprivileged user namespace isolation using 65536 UIDs and GIDs starting from 864165. I tried to build the image according to the readme, but it failed with these errors:

107.0 + tar xzf protobuf.tar.gz -C /root/.build/protobuf_src --strip-components=1
107.0 tar: six.BUILD: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/add_person.py: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/WORKSPACE: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/list_people.dart: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/add_person.cc: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/AddPerson.java: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/list_people.cc: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/addressbook.proto: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/third_party/zlib.BUILD: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/third_party: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/list_people_test.go: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/list_people.py: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/list_people.go: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/ListPeople.java: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/add_person.dart: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/add_person.go: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/add_person_test.go: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/Makefile: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/CMakeLists.txt: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/README.md: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/BUILD: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples/pubspec.yaml: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: examples: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.0 tar: LICENSE: Cannot change ownership to uid 231664, gid 89939: Invalid argument
...
107.1 tar: config.sub: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: config.h.in: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: aclocal.m4: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: protobuf-lite.pc.in: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: config.guess: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: util/python/BUILD: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: util/python: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: util: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: README.md: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: BUILD: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: compile: Cannot change ownership to uid 231664, gid 89939: Invalid argument
107.1 tar: Exiting with failure status due to previous errors
107.1 xargs: bash: exited with status 255; aborting
------
Dockerfile:25
--------------------
  23 |     RUN set -ex; find /root/.build/scripts_host -iname '*.sh' -exec chmod a+x {} \;
  24 |     WORKDIR /root/.build/scripts_host
  25 | >>> RUN bash -c 'set -ex -o pipefail; find . -iregex "^\.\/[0-9]+_.*" | sort | xargs -n1 -I {} bash -c "{} || exit 255"'
  26 |
  27 |     COPY --chown=root:root scripts /root/.build/scripts/
--------------------
ERROR: failed to solve: process "/bin/sh -c bash -c 'set -ex -o pipefail; find . -iregex \"^\\.\\/[0-9]+_.*\" | sort | xargs -n1 -I {} bash -c \"{} || exit 255\"'" did not complete successfully: exit code: 124
nikkon-dev commented 1 year ago

@lahwaacz,

Could you try the solution mentioned here? https://github.com/containers/buildah/issues/1702#issuecomment-508143700

lahwaacz commented 1 year ago

@nikkon-dev Thanks for the advice. I did not try passing --no-same-owner to tar, but extending the UID/GID mapping to provide 262144 IDs (4 * 2^16) for docker containers allowed the build to pass.

lahwaacz commented 1 year ago

But now I have a problem when I try to use the image to actually build DCGM. The build script gives this error:

+ ./build.sh --arch amd64 --sa-mode 0 --release --
fatal: --local can only be used inside a git repository
Installing git-lfs locally
Error: failed to call git rev-parse --git-dir: exit status 128 : fatal: detected dubious ownership in repository at '/workspaces/DCGM'
To add an exception for this directory, call:

        git config --global --add safe.directory /workspaces/DCGM

Not in a git repository.

I see the script runs a docker container with -u "$(id -u)":"$(id -g)" and maps the cloned git repository as a volume to /workspaces/DCGM in the container, which does not work correctly with user namespacing...