containers / buildah

A tool that facilitates building OCI images.
https://buildah.io
Apache License 2.0
7.36k stars 780 forks source link

Failed to discover NVIDIA GPU in the running container started by buildah (vfs + chroot) #5227

Open enihcam opened 10 months ago

enihcam commented 10 months ago

Description Failed to discover NVIDIA GPU in the running container started by buildah (vfs + chroot)

Steps to reproduce the issue:

  1. start a GPU container that does NOT support Docker-in-Docker (for security reasons)
  2. install buildah
  3. configure storage driver export STORAGE_DRIVER=vfs and isolation export BUILDAH_ISOLATION=chroot
  4. build a PyTorch+CUDA image with buildah and run with buildah

Describe the results you received: image

Describe the results you expected: pytorch finds the gpu run the code successfully.

Output of rpm -q buildah or apt list buildah:

# rpm -q buildah
buildah-1.30.0-1.tl4.x86_64

Output of buildah version:

# buildah version
Version:         1.30.0
Go Version:      go1.19
Image Spec:      1.0.2-dev
Runtime Spec:    1.1.0-rc.1
CNI Spec:        1.0.0
libcni Version:  v1.1.2
image Version:   5.25.0
Git Commit:
Built:           Fri Jul 14 19:36:27 2023
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

Output of podman version if reporting a podman build issue:

(paste your output here)

*Output of `cat /etc/release`:**

# cat /etc/*release
NAME="TencentOS Server"
VERSION="4.0"
ID="tencentos"
ID_LIKE="tencentos"
VERSION_ID="4.0"
PLATFORM_ID="platform:tl4.0"
PRETTY_NAME="TencentOS Server 4.0"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:tencentos:tencentos:4.0"
HOME_URL="https://cloud.tencent.com/product/ts"
BUG_REPORT_URL="https://cloud.tencent.com/product/ts"
TencentOS Server 4.0

Output of uname -a:

# uname -a
Linux root-pvkf3ma0a 5.4.119-19.0009.28 #1 SMP Thu May 18 10:37:10 CST 2023 x86_64 GNU/Linux

Output of cat /etc/containers/storage.conf:

# cat /etc/containers/storage.conf
[storage]
driver = "vfs"
runroot = "/data/containers/storage"
graphroot = "/data/containers/storage"
rootless_storage_path = "/data/containers/storage"

[storage.options.vfs]
ignore_chown_errors = "true"
rhatdan commented 10 months ago

Isn't the GPU a device? Say /dev/gpu?

Could you try

ctr=$(buildah from --device /dev/gpu ...) buildah run $ctr ...

github-actions[bot] commented 9 months ago

A friendly reminder that this issue had no activity for 30 days.

enihcam commented 3 months ago

Isn't the GPU a device? Say /dev/gpu?

Could you try

ctr=$(buildah from --device /dev/gpu ...) buildah run $ctr ...

sorry for late reply. i tried the following:

ctr=$(buildah --device /dev/nvidia0 from for.example.com/gpu_image_for_test)
buildah run $ctr /bin/bash

and then nvidia-smi gave me no output at all.

btw, this container is run in another container with vfs+chroot mode.

rhatdan commented 3 months ago

Could you try

buildah --device=nvidia.com/gpu=all from ...

enihcam commented 3 months ago

Could you try

buildah --device=nvidia.com/gpu=all from ...

stat nvidia.com/gpu=all: no such file or directory

rhatdan commented 3 months ago

What version of buildah are you using?

enihcam commented 3 months ago

What version of buildah are you using?

~ # buildah version
Version:         1.33.7
Go Version:      go1.21.9 (Red Hat 1.21.9-1.module+el8.8.0+632+2dde9914)
Image Spec:      1.1.0-rc.5
Runtime Spec:    1.1.0
CNI Spec:        1.0.0
libcni Version:  v1.1.2
image Version:   5.29.2
Git Commit:
Built:           Tue Jun 18 11:12:42 2024
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64
~ # env | grep BUILDAH
BUILDAH_FORMAT=docker
BUILDAH_ISOLATION=chroot
~ # env | grep STORAGE
STORAGE_DRIVER=vfs
rhatdan commented 3 months ago

Any chance you can update the version?

$ buildah -v
buildah version 1.36.0 (image-spec 1.1.0, runtime-spec 1.2.0)
tmp $ buildah version
Version:         1.36.0
Go Version:      go1.22.3
Image Spec:      1.1.0
Runtime Spec:    1.2.0
CNI Spec:        1.0.0
libcni Version:  
image Version:   5.31.0
Git Commit:      
Built:           Mon May 27 09:11:54 2024
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64
rhatdan commented 3 months ago
$ git show 7658d9ed7e02ec5cf90cc397f78a5755599b0a32
commit 7658d9ed7e02ec5cf90cc397f78a5755599b0a32
Author: Daniel J Walsh <dwalsh@redhat.com>
Date:   Mon Mar 25 11:55:50 2024 -0400

    Support nvidia.com/gpus as devices

    Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>

diff --git a/pkg/parse/parse_unix.go b/pkg/parse/parse_unix.go
index ff8ce854e..d3f3dc14c 100644
--- a/pkg/parse/parse_unix.go
+++ b/pkg/parse/parse_unix.go
@@ -7,6 +7,7 @@ import (
        "fmt"
        "os"
        "path/filepath"
+       "strings"

        "github.com/containers/buildah/define"
        "github.com/opencontainers/runc/libcontainer/devices"
@@ -18,6 +19,12 @@ func DeviceFromPath(device string) (define.ContainerDevices, error) {
        if err != nil {
                return nil, err
        }
+       if strings.HasPrefix(src, "nvidia.com") {
+               device := define.BuildahDevice{Source: src, Destination: dst}
+               devs = append(devs, device)
+               return devs, nil
+       }
+
        srcInfo, err := os.Stat(src)
        if err != nil {
                return nil, fmt.Errorf("getting info of source device %s: %w", src, err)
rhatdan commented 3 months ago

Yes 1.36 has the patch.

enihcam commented 3 months ago

Yes 1.36 has the patch.

https://github.com/containers/buildah/blob/release-1.36/pkg/parse/parse_unix.go

It seems like the patch is missing. Could you confirm? Thanks.

forwardmeasure commented 2 months ago

Hello all, any update here? I don't see parse_unix.go having the patch that was mentioned.

enihcam commented 2 months ago

@rhatdan your input is needed.

nalind commented 2 months ago

Does the container have access to the necessary CDI configuration in its /etc/cdi directory, either volume-mounted from the host where nvidia-ctk cdi generate was run to generate it, or via some other mechanism?

enihcam commented 1 month ago

Any workaround before the PR is merged?

nalind commented 1 month ago

I think the current expectation is that, if the data in /etc/cdi is provided to the container, we won't need this PR, since the CDI logic in 1.36 (and 1.37) already gets a crack at device specifications.

Does the container have access to the necessary CDI configuration in its /etc/cdi directory, either volume-mounted from the host where nvidia-ctk cdi generate was run to generate it, or via some other mechanism?