containers / buildah

A tool that facilitates building OCI images.
https://buildah.io
Apache License 2.0
7.49k stars 786 forks source link

Use of CDI does not consume labeled devices during build #5556

Open kenmoini opened 8 months ago

kenmoini commented 8 months ago

Issue Description

When using NVIDIA GPUs with Podman via the Container Device Interface podman build fails to use labeled devices while podman run works as intended.

However, if using the direct device path the podman build execution works as expected.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Install NVIDIA Drivers
  2. Install Podman
  3. Install NVIDIA Container Toolkit:
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

sudo dnf install -y nvidia-container-toolkit
  1. Configure NVIDIA CTK for use with CDI: nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
  2. Test CDI integration for podman run which works: podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
  3. Start a podman build with the same device label which fails:
# Get a test containerfile
curl -O https://raw.githubusercontent.com/kenmoini/smart-drone-patterns/main/apps/darknet/Containerfile.ubnt22

# Build a container with the device label which fails
podman build --device nvidia.com/gpu=all --security-opt=label=disable -t darknet -f Containerfile.ubnt22 .
# - Output
Error: creating build executor: getting info of source device nvidia.com/gpu=all: stat nvidia.com/gpu=all: no such file or directory

# Build a container with the direct device path which works
podman build --device /dev/nvidia0 -t darknet -f Containerfile.ubnt22 --security-opt=label=disable .

Describe the results you received

The result of using the CDI device label fails:

podman build --device nvidia.com/gpu=all --security-opt=label=disable -t darknet -f Containerfile.ubnt22 .

Error: creating build executor: getting info of source device nvidia.com/gpu=all: stat nvidia.com/gpu=all: no such file or directory

Describe the results you expected

The container build to start with the device label - only works if you use the device path, but that doesn't seem to load all the associated paths that are defined in the generated CDI configuration.

podman info output

host:
  arch: arm64
  buildahVersion: 1.31.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-1.el9.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: f0f506932ce1dc9fc7f1adb457a73d0a00207272'
  cpuUtilization:
    idlePercent: 99.98
    systemPercent: 0.01
    userPercent: 0.01
  cpus: 32
  databaseBackend: boltdb
  distribution:
    distribution: '"rhel"'
    version: "9.3"
  eventLogger: journald
  freeLocks: 2048
  hostname: avalon.kemo.labs
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.14.0-362.18.1.el9_3.aarch64
  linkmode: dynamic
  logDriver: journald
  memFree: 121339949056
  memTotal: 133915746304
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.el9.aarch64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-2.el9_3.aarch64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.el9.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /bin/pasta
    package: passt-0^20230818.g0af928e-4.el9.aarch64
    version: |
      pasta 0^20230818.g0af928e-4.el9.aarch64
      Copyright Red Hat
      GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.2.1-1.el9.aarch64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 4294963200
  swapTotal: 4294963200
  uptime: 105h 12m 27.00s (Approximately 4.38 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 1993421922304
  graphRootUsed: 28735803392
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 4
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1705652546
  BuiltTime: Fri Jan 19 03:22:26 2024
  GitCommit: ""
  GoVersion: go1.20.12
  Os: linux
  OsArch: linux/arm64
  Version: 4.6.1

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

No

Additional environment details

Running on RHEL 9.3 on an Ampere Altra system - same error on an X86 system.

Additional information

Looks like this also affects buildah: https://github.com/containers/buildah/issues/5432 https://github.com/containers/buildah/pull/5443

github-actions[bot] commented 7 months ago

A friendly reminder that this issue had no activity for 30 days.

oglok commented 6 months ago

Same here!! We need to access GPUs for some builds, not only when running the container.

rhatdan commented 6 months ago

@nalind PTAL

nalind commented 6 months ago

This should work as of 1.36, which includes #5443 and #5494.