I have a system that is running Windows 11 (fully updated), WSL2 Ubuntu 22.04.2, Cuda 12.2 installed via NVIDIA instructions. I have 4 RTX3090s that are exposed in WSL. I have installed Docker Desktop 4.22 and set WSL as the engine. When running 2 different docker contaners that has GPU support I get either a "out of memory error" or "Error: only 0 Devices available, 1 requested. Exiting."
NVIDIA Release 23.07 (build 63867923)
PyTorch Version 2.1.0a0+b5021ba
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ Out of memory (error 2) ]]
root@95ab4a1ff094:/workspace#
2. docker run --gpus device=GPU-acb454dd-c800-18af-ce94-27e4f8ece22a nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -benchmark -gpu
Run "nbody -benchmark [-numbodies=]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies= (number of bodies (>= 1) to run in simulation)
-device= (where d=0,1,2.... for the CUDA device to use)
-numdevices= (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy= (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Error: only 0 Devices available, 1 requested. Exiting.
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 3 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 3090"
CUDA Driver Version / Runtime Version 12.2 / 12.2
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24576 MBytes (25769279488 bytes)
(082) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores
GPU Max Clock rate: 1800 MHz (1.80 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA GeForce RTX 3090"
CUDA Driver Version / Runtime Version 12.2 / 12.2
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24576 MBytes (25769279488 bytes)
(082) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores
GPU Max Clock rate: 1800 MHz (1.80 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 2: "NVIDIA GeForce RTX 3090"
CUDA Driver Version / Runtime Version 12.2 / 12.2
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24576 MBytes (25769279488 bytes)
(082) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores
GPU Max Clock rate: 1800 MHz (1.80 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Peer access from NVIDIA GeForce RTX 3090 (GPU0) -> NVIDIA GeForce RTX 3090 (GPU1) : No
Peer access from NVIDIA GeForce RTX 3090 (GPU0) -> NVIDIA GeForce RTX 3090 (GPU2) : No
Peer access from NVIDIA GeForce RTX 3090 (GPU1) -> NVIDIA GeForce RTX 3090 (GPU0) : No
Peer access from NVIDIA GeForce RTX 3090 (GPU1) -> NVIDIA GeForce RTX 3090 (GPU2) : No
Peer access from NVIDIA GeForce RTX 3090 (GPU2) -> NVIDIA GeForce RTX 3090 (GPU0) : No
Peer access from NVIDIA GeForce RTX 3090 (GPU2) -> NVIDIA GeForce RTX 3090 (GPU1) : No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 3
Result = PASS
### Expected behavior
Expected behavior is that the GPUs are available for usage within the 2 respective containers.
### docker version
```bash
Client:
Cloud integration: v1.0.35-desktop+001
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:36:24 2023
OS/Arch: windows/amd64
Context: default
Server: Docker Desktop 4.22.1 (118664)
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:35:45 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Version: 24.0.5
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.11.2-desktop.1
Path: C:\Program Files\Docker\cli-plugins\docker-buildx.exe
compose: Docker Compose (Docker Inc.)
Version: v2.20.2-desktop.1
Path: C:\Program Files\Docker\cli-plugins\docker-compose.exe
dev: Docker Dev Environments (Docker Inc.)
Version: v0.1.0
Path: C:\Program Files\Docker\cli-plugins\docker-dev.exe
extension: Manages Docker extensions (Docker Inc.)
Version: v0.2.20
Path: C:\Program Files\Docker\cli-plugins\docker-extension.exe
init: Creates Docker-related starter files for your project (Docker Inc.)
Version: v0.1.0-beta.6
Path: C:\Program Files\Docker\cli-plugins\docker-init.exe
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
Version: 0.6.0
Path: C:\Program Files\Docker\cli-plugins\docker-sbom.exe
scan: Docker Scan (Docker Inc.)
Version: v0.26.0
Path: C:\Program Files\Docker\cli-plugins\docker-scan.exe
scout: Command line tool for Docker Scout (Docker Inc.)
Version: 0.20.0
Path: C:\Program Files\Docker\cli-plugins\docker-scout.exe
Server:
Containers: 19
Running: 1
Paused: 0
Stopped: 18
Images: 2
Server Version: 24.0.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
seccomp
Profile: unconfined
Kernel Version: 5.15.90.1-microsoft-standard-WSL2
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 377.7GiB
Name: docker-desktop
ID: e76ea70c-fad4-425c-8df1-29786ac0e18e
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5555
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
WARNING: daemon is not using the default seccomp profile
Description
I have a system that is running Windows 11 (fully updated), WSL2 Ubuntu 22.04.2, Cuda 12.2 installed via NVIDIA instructions. I have 4 RTX3090s that are exposed in WSL. I have installed Docker Desktop 4.22 and set WSL as the engine. When running 2 different docker contaners that has GPU support I get either a "out of memory error" or "Error: only 0 Devices available, 1 requested. Exiting."
Reproduce
============= == PyTorch ==
NVIDIA Release 23.07 (build 63867923) PyTorch Version 2.1.0a0+b5021ba
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2023 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ Out of memory (error 2) ]]
root@95ab4a1ff094:/workspace#
Run "nbody -benchmark [-numbodies=]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies= (number of bodies (>= 1) to run in simulation)
-device= (where d=0,1,2.... for the CUDA device to use)
-numdevices= (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy= (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Error: only 0 Devices available, 1 requested. Exiting.
theskaz@KevenPC:~/cuda-samples/Samples/1_Utilities/deviceQuery$ nvidia-smi --query-gpu=uuid --format=csv uuid GPU-52dc20f4-4d06-1338-85c9-c493c5dcec02 GPU-873ec493-2856-0e4e-b7c4-47f859451051 GPU-48b827a9-e51b-75c5-1149-389a3d45d09c GPU-acb454dd-c800-18af-ce94-27e4f8ece22a
theskaz@KevenPC:~/cuda-samples/Samples/1_Utilities/deviceQuery$ CUDA_VISIBLE_DEVICES=0,2,3 ./deviceQuery ./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 3 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 3090" CUDA Driver Version / Runtime Version 12.2 / 12.2 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 24576 MBytes (25769279488 bytes) (082) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores GPU Max Clock rate: 1800 MHz (1.80 GHz) Memory Clock rate: 9751 Mhz Memory Bus Width: 384-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "NVIDIA GeForce RTX 3090" CUDA Driver Version / Runtime Version 12.2 / 12.2 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 24576 MBytes (25769279488 bytes) (082) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores GPU Max Clock rate: 1800 MHz (1.80 GHz) Memory Clock rate: 9751 Mhz Memory Bus Width: 384-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 65 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 2: "NVIDIA GeForce RTX 3090" CUDA Driver Version / Runtime Version 12.2 / 12.2 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 24576 MBytes (25769279488 bytes) (082) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores GPU Max Clock rate: 1800 MHz (1.80 GHz) Memory Clock rate: 9751 Mhz Memory Bus Width: 384-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 3 Result = PASS
docker info
Diagnostics ID
CC53A74D-0891-4068-868B-0FD1A097A965/20230825153431
Additional Info