4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

分配2张vgpu却只能看到1张 #17

Closed xwhuang0923 closed 2 years ago

xwhuang0923 commented 2 years ago

一台8张A100 的机器,每张卡分成5张vgpu --device-split-count=5。创建一个2 vgpu的pod,在容器里使用nvidia-smi 命令只能看到一张vgpu,/dev 目录下能看到两个gpu。k8s-vgpu-plugin 为v0.9.0.18

archlitchi commented 2 years ago

能不能把容器里面env的结果发一下?

xwhuang0923 commented 2 years ago

@archlitchi NV_LIBCUBLAS_VERSION=11.4.1.1043-1 NVIDIA_VISIBLE_DEVICES=GPU-0f373af5-38b2-7805-80b7-ae361c172a9f,GPU-a1122828-8f86-3460-77d3-8ef8d841a997 KUBERNETES_SERVICE_PORT_HTTPS=443 NV_NVML_DEV_VERSION=11.2.152-1 KUBERNETES_SERVICE_PORT=443 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1 HOSTNAME=mnist-horovod-0 SVC_RESOURCE_MASTER_SERVICE_PORT=22 NVIDIA_DEVICE_MAP=0:GPU-0f373af5-38b2-7805-80b7-ae361c172a9f NVIDIA_REQUIRE_CUDA=cuda>=11.2 brand=tesla,driver>=440,driver<441 NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-11-2=11.4.1.1043-1 NV_NVTX_VERSION=11.2.152-1 NV_ML_REPO_ENABLED=1 NV_CUDA_CUDART_DEV_VERSION=11.2.152-1 NV_LIBCUSPARSE_VERSION=11.4.1.1152-1 NV_LIBNPP_VERSION=11.3.2.152-1 NCCL_VERSION=2.8.4-1 CUDA_DEVICE_SM_LIMIT=20 PWD=/root LOGNAME=root NVIDIA_DRIVER_CAPABILITIES=compute,utility USESECRETS=true NV_LIBNPP_PACKAGE=libnpp-11-2=11.3.2.152-1 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev NV_LIBCUBLAS_DEV_VERSION=11.4.1.1043-1 NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-11-2 MOTD_SHOWN=pam NV_CUDA_CUDART_VERSION=11.2.152-1 CUDA_VERSION=11.2.2 NV_LIBCUBLAS_PACKAGE=libcublas-11-2=11.4.1.1043-1 SVC_RESOURCE_MASTER_SERVICE_PORT_JUPYTER=8888 SVC_RESOURCE_MASTER_SERVICE_PORT_SSH=22 SSH_CONNECTION=174.30.0.188 48508 174.30.0.188 22 NV_LIBNPP_DEV_PACKAGE=libnpp-dev-11-2=11.3.2.152-1 NV_LIBCUBLAS_PACKAGE_NAME=libcublas-11-2 NV_LIBNPP_DEV_VERSION=11.3.2.152-1 SVC_RESOURCE_MASTER_PORT=tcp://10.68.82.141:22 SVC_RESOURCE_MASTER_PORT_8888_TCP=tcp://10.68.82.141:8888 TERM=xterm NV_ML_REPO_URL=https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64 NV_LIBCUSPARSE_DEV_VERSION=11.4.1.1152-1 CUDA_DEVICE_MEMORY_LIMIT_0=8107m CUDA_DEVICE_MEMORY_LIMIT_1=8107m SVC_RESOURCE_MASTER_PORT_22_TCP_PORT=22 USER=root CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/13c9afee-7dc6-4f6d-9e46-b0229b877990.cache LIBRARY_PATH=/usr/local/cuda/lib64/stubs SHLVL=2 SVC_RESOURCE_MASTER_PORT_8888_TCP_PROTO=tcp NV_CUDA_LIB_VERSION=11.2.2-1 NVARCH=x86_64 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=10.68.0.1 SVC_RESOURCE_MASTER_PORT_22_TCP=tcp://10.68.82.141:22 NV_CUDA_COMPAT_PACKAGE=cuda-compat-11-2 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2 LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 SSH_CLIENT=174.30.0.188 48508 22 SVC_RESOURCE_MASTER_PORT_8888_TCP_PORT=8888 KUBERNETES_SERVICE_HOST=10.68.0.1 KUBERNETES_PORT=tcp://10.68.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin NV_LIBNCCL_PACKAGE_NAME=libnccl2 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1

archlitchi commented 2 years ago

要不先升级到0.9.0.19试一下,如果还是不work的话,加我wx:xuanzong4493

archlitchi commented 2 years ago

@xwhuang0923 怎么样,升级之后能work吗

xwhuang0923 commented 2 years ago

@archlitchi 你好,我发现是 NVIDIA_DEVICE_MAP 这个环境设置的问题。容器启动时对环境变量有修改,所以导致NVIDIA_DEVICE_MAP中第二个vgpu的环境变量设置没成功。