Open lengrongfu opened 2 weeks ago
@elezar Can you help me look into this issue?
Could you try to update your workload to use the following container instead:
nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
Also, is the nvidia
runtime configured as your default runtime, or are you using a runtime class? If it is the latter, you would also need to specify a runtime class in your workload.
I use nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
this image to deploy workload, error then having.
Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
[Vector addition of 50000 elements]
nvidia
runtime is configured default.
$ cat /etc/containerd/config.toml
disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2
[cgroup]
path = ""
[debug]
address = ""
format = ""
gid = 0
level = ""
uid = 0
[grpc]
address = "/run/containerd/containerd.sock"
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
tcp_address = ""
tcp_tls_ca = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
[metrics]
address = ""
grpc_histogram = false
[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
deletion_threshold = 0
mutation_threshold = 100
pause_threshold = 0.02
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
device_ownership_from_security_context = false
disable_apparmor = false
disable_cgroup = false
disable_hugetlb_controller = true
disable_proc_mount = false
disable_tcp_service = true
enable_cdi = false
enable_selinux = false
enable_tls_streaming = false
enable_unprivileged_icmp = false
enable_unprivileged_ports = false
ignore_image_defined_volumes = false
max_concurrent_downloads = 3
max_container_log_line_size = 16384
netns_mounts_under_state_dir = false
restrict_oom_score_adj = false
sandbox_image = "easzlab.io.local:5000/easzlab/pause:3.9"
selinux_category_range = 1024
stats_collect_period = 10
stream_idle_timeout = "4h0m0s"
stream_server_address = "127.0.0.1"
stream_server_port = "0"
systemd_cgroup = false
tolerate_missing_hugetlb_controller = true
unset_seccomp_profile = ""
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
conf_template = "/etc/cni/net.d/10-default.conf"
max_conf_num = 1
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
disable_snapshot_annotations = true
discard_unpacked_layers = false
ignore_rdt_not_enabled_errors = false
no_pivot = false
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = ""
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = ""
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = ""
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]
[plugins."io.containerd.grpc.v1.cri".image_decryption]
key_model = "node"
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.auths]
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.headers]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://docker.nju.edu.cn/", "https://kuamavit.mirror.aliyuncs.com"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."easzlab.io.local:5000"]
endpoint = ["http://easzlab.io.local:5000"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."gcr.io"]
endpoint = ["https://gcr.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
endpoint = ["https://ghcr.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor.easzlab.io.local:8443"]
endpoint = ["https://harbor.easzlab.io.local:8443"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
endpoint = ["https://gcr.nju.edu.cn/google-containers/"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."nvcr.io"]
endpoint = ["https://ngc.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."quay.io"]
endpoint = ["https://quay.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.nri.v1.nri"]
disable = false
disable_connections = false
plugin_config_path = "/etc/nri/conf.d"
plugin_path = "/opt/nri/plugins1"
plugin_registration_timeout = "5s"
plugin_request_timeout = "2s"
socket_path = "/var/run/nri/nri.sock"
[plugins."io.containerd.runtime.v1.linux"]
no_shim = false
runtime = "runc"
runtime_root = ""
shim = "containerd-shim"
shim_debug = false
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.aufs"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.btrfs"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.devmapper"]
async_remove = false
base_image_size = ""
pool_name = ""
root_path = ""
[plugins."io.containerd.snapshotter.v1.native"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.overlayfs"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.zfs"]
root_path = ""
[proxy_plugins]
[stream_processors]
[stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
path = "ctd-decoder"
returns = "application/vnd.oci.image.layer.v1.tar"
[stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
path = "ctd-decoder"
returns = "application/vnd.oci.image.layer.v1.tar+gzip"
[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"
[ttrpc]
address = ""
gid = 0
uid = 0
I found that it is possible to run the mps program directly on the host, but in the container it will prompt that device(s) is/are busy or unavailable
I found that it is possible to run the mps program directly on the host, but in the container it will prompt that
device(s) is/are busy or unavailable
Could you provide more information on how you achieved this. Note that one of the key communication mechanism between the MPS processes is the /dev/shm that we create for the containerized daemon. How are you injecting this into the container?
I found that it is possible to run the mps program directly on the host, but in the container it will prompt that
device(s) is/are busy or unavailable
Could you provide more information on how you achieved this. Note that one of the key communication mechanism between the MPS processes is the /dev/shm that we create for the containerized daemon. How are you injecting this into the container?
First thanks for the quick answer.
About me in container to use MPS step:
gpu-operatpr
to install driver and toolkit.k8s-device-plugin
to deploy mps controller and device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin --version=0.15.0-rc.2 --namespace nvidia-device-plugin --create-namespace --set config.name=nvidia-plugin-configs --set gfd.enabled=true
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
About you tips MPS processes is /dev/shm
to communication, what do I need to do with this?
@elezar Need any more information?
Sorry for the delay, @lengrongfu. Since you're using the GPU Operator to install the other components of the NVIDIA Container Stack. Can you confirm that it isn't managing the device plugin? Which pods are running in the GPU Operator namespace?
Also, to rule out any issues in the rc.2
, could you deploy the v0.15.0
version of the device plugin that was released last week.
It would also be good to confirm that the workload container can properly access the MPS control daemon with the correct settings. Here, I would recommend updating the command to sleep 9999
and then exec into the container and run:
echo get_default_active_thread_percentage | mps-control-daemon
This should give 10
in your case.
Thank you for your reply.
"Can you confirm that it isn't managing the device plugin? Which pods are running in the GPU Operator namespace?" Confirm that it does not exist.
"Also, to rule out any issues in the rc.2, could you deploy the v0.15.0 version of the device plugin that was released last week." I use 0.15.0
version to deploy, this issue still exist.
echo get_default_active_thread_percentage | mps-control-daemon
exec logI watch mps-control-daemon
having a log is User did not send valid credentials
, will this have any impact?
in nvidia-driver-daemonset
pod exec nvidia-smi
command can see nvidia-cuda-mps-server
process is in use GPU device.
GPU compute is Exclusive_Process
:
Could you run:
echo get_default_active_thread_percentage | nvidia-cuda-mps-control
in a workload conatiner: For example, the following one:
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
command: ["bash", "-c"]
args: ["nvidia-smi -L; sleep 9999"]
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
echo get_default_active_thread_percentage | mps-control-daemon
mps-control-daemon
: command not found
Sorry it should be echo get_default_active_thread_percentage | nvidia-cuda-mps-control
. A typo from my side.
It return value is 10.0
.
Just as a sanity check, could you confirm that running nvidia-smi
produces the same output as in the driver container?
Looking through the configs again, since the GPU Operator is being used configure the toolkit and the driver, I would expect the nvidiaDriverRoot
for the device plugin to be set to /run/nvidia/driver
and not:
"nvidiaDriverRoot": "/",
as is shown in your config.
Could you update the device plugin deployment with --set nvidiaDriverRoot=/run/nvidia/driver
?
Use helm update nvidiaDriverRoot
field, and add a volume todevice-plugin
pod, then pod start success, but gpu-pod
run error.
Maybe it has something to do with Tesla P40
.
Maybe it has something to do with
Tesla P40
.
I'm running into the same issue on a GTX1070 w/ the same driver version as you. I wonder if a driver update would help.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here] (https://enterprise-support.nvidia.com/s/create-case)..
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
I use helm to deploy k8s-device-plugin, and config mps, but deploy a workload running error. mps-controller-daemon pod having running.
3. Information to attach (optional if deemed irrelevant)
I use
gpu-operator
to install gpu driver, use helm chart version is v23.9.1, and driver、toolkit having install success. and then i use followers helm command to install k8s-device-plugin:nvidia-plugin-configs config content is :
deploy workload pod command is:
and then pod status is Error, error log is:
device-plugin
pod log:mps-controller-daemon
pod log:GPU info :