Closed sachinkum0009 closed 1 year ago
How are you starting your docker container. It depends a lot on the exact versions of the docker, nvidia-docker, docker-compose and Cuda drivers, it seems like they change the way devices are discovered and passed through to the container every other release.
Thanks for you response I am using the command
docker run --gpus all --rm -it -p 9099:9099 cmusatyalab/openrtist:stable
It gives the error message. I have installed the cuda drivers and cudnn drivers on the Ubuntu 20.04 I also tried to use the installation from source for gpu, but it also has the same issue with the gpu CUDNN_STATUS_EXECUTION_FAILED
Please guide me how can solve this issue. Thanks. :heart:
Can you paste the output of nvidia-smi
on both the host and from within the container? You can get a shell inside the container by simply running docker run --gpus all --rm -it --entrypoint /bin/bash cmusatyalab/openrtist:stable
Thanks for the response
I have attached the output from nvidia-smi
from the host and from the container.
Tue May 9 13:20:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 69C P5 18W / N/A | 603MiB / 5946MiB | 21% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1800 G /usr/lib/xorg/Xorg 45MiB |
| 0 N/A N/A 3518 G /usr/lib/xorg/Xorg 223MiB |
| 0 N/A N/A 3900 G /usr/bin/gnome-shell 64MiB |
| 0 N/A N/A 5471 G ...383287598542382230,131072 52MiB |
| 0 N/A N/A 9892 G gnome-control-center 2MiB |
| 0 N/A N/A 11364 G ...RendererForSitePerProcess 10MiB |
| 0 N/A N/A 23662 G gzserver 99MiB |
| 0 N/A N/A 23685 G gzclient 92MiB |
+-----------------------------------------------------------------------------+
Tue May 9 11:48:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 65C P8 16W / N/A | 432MiB / 5946MiB | 15% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
It seems like CUDA was only partially installed. Typically the cuda installation also includes all the cudnn libraries. How did you install cuda initially? You may have to uninstall (with the runfile or apt purge nvidia cuda) and then reinstall. I usually use this page to select my os/arch and select the local deb option.
This matrix may help us hone in on the issue. Depending on your OS there are particular minimum kernels that are supported (which you can find with uname -r
). It does appear that your nvidia driver version is supported by CUDA 11.4 as the minimum is 450.80.02 for Linux.
I tried to install the cuda 12.1 and it has been installed correctly
But still it gives the same error as before
Adding the log for the nvidia-smi
for host and docker
Tue May 9 16:07:28 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 L... On | 00000000:01:00.0 Off | N/A |
| N/A 56C P8 10W / 55W| 593MiB / 6144MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1951 G /usr/lib/xorg/Xorg 45MiB |
| 0 N/A N/A 3471 G /usr/lib/xorg/Xorg 196MiB |
| 0 N/A N/A 3864 G /usr/bin/gnome-shell 92MiB |
| 0 N/A N/A 7973 G ...76114573,9971327053354293095,262144 186MiB |
| 0 N/A N/A 22960 G ...,WinRetrieveSuggestionsOnlyOnDemand 61MiB |
+---------------------------------------------------------------------------------------+
Tue May 9 14:09:56 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 L... On | 00000000:01:00.0 Off | N/A |
| N/A 54C P8 9W / 55W| 553MiB / 6144MiB | 15% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
I also tried to test the installation from source and it also gives the same error message. Can you please help me to solve this issue.
Thanks :)
When do you receive this error? Is it when the container first launches? Is it after you connect a client and try to infer an image? Does it work if you use docker run --rm -it -p 9099:9099 cmusatyalab/openrtist:stable
to run with CPU only? There is an asyncio error prior to the cudnn one in your stack trace so I am wondering if the cudnn message is an effect rather than a cause. If the container launches and loads the model (I believe it prints a message that it has finished initialization) then the model is getting loaded onto the GPU.
Yes, when the container first launches, after 1 min (approx) this error comes.
No, I am waiting for the container to initialize the model and print message to connect the client.
Yes, It works with docker run --rm -it -p 9099:9099 cmusatyalab/openrtist:stable
with CPU only.
I didn't see this message any message initialization finished because of the error. cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
We never determined which kernel you are using. What does uname -r
reveal? Does that kernel meet the minimum required in the matrix I posted earlier?
Yes, output of uname -r
is
5.15.0-69-generic
How did you obtain the container image? Did you pull it from Docker hub or build it locally? It looks like stable and latest images on Docker hub could be out of date, so I wonder if you build it locally (docker build -t cmusatyalab/openrtist:dev .
when in the openrtist root directory) and then launch it using the dev tag if it succeeds.
Output for nvidia-smi
$ nvidia-smi
Wed May 10 14:24:56 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 L... On | 00000000:01:00.0 Off | N/A |
| N/A 56C P8 10W / 55W| 301MiB / 6144MiB | 10% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1953 G /usr/lib/xorg/Xorg 45MiB |
| 0 N/A N/A 3645 G /usr/lib/xorg/Xorg 93MiB |
| 0 N/A N/A 4272 G /usr/bin/gnome-shell 52MiB |
| 0 N/A N/A 16537 G ...937287336,443561402869045485,262144 99MiB |
+---------------------------------------------------------------------------------------+
Output for nvcc
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
How did you obtain the container image? Did you pull it from Docker hub or build it locally? It looks like stable and latest images on Docker hub could be out of date, so I wonder if you build it locally (
docker build -t cmusatyalab/openrtist:dev .
when in the openrtist root directory) and then launch it using the dev tag if it succeeds.
Thanks I will try this
Build the container locally and it gives error related to PIL
INFO:__main__:Detected GPU / CUDA support
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/gabriel_server/local_engine.py", line 67, in _run_engine
engine = engine_factory()
File "./main.py", line 118, in engine_setup
adapter = create_adapter(args.openvino, args.cpu_only, args.torch, args.myriad)
File "./main.py", line 42, in create_adapter
from torch_adapter import TorchAdapter
File "/openrtist/server/torch_adapter.py", line 35, in <module>
from torchvision import transforms
File "/usr/local/lib/python3.7/dist-packages/torchvision/__init__.py", line 4, in <module>
from torchvision import datasets
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/__init__.py", line 9, in <module>
from .fakedata import FakeData
File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/fakedata.py", line 3, in <module>
from .. import transforms
File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/__init__.py", line 1, in <module>
from .transforms import *
File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/transforms.py", line 17, in <module>
from . import functional as F
File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/functional.py", line 5, in <module>
from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/usr/local/lib/python3.7/dist-packages/PIL/__init__.py)
Let me build it locally and see if I can reproduce. Once we get the local version building I can push it to Docker hub since that seems to be dated. We used to have CI that did this, but it appears to have stopped working.
Thanks a lot
I was able to reproduce locally. It looks like PILLOW_VERSION was removed in PIL 7. torchvision > 0.5.0
apparently has a fix for this (by using version instead of PILLOW_VERSION), but rather than change the torchvision version, I simply added pillow<7
to server/requirements.txt. If you add this and then rebuild the image, it should work. Alternatively you could wait for me to push the change to github and the new image to Docker hub, but I may not get around to doing that today.
Thanks a lot I will try to make these changes and test.
diff --git a/server/requirements.txt b/server/requirements.txt
index cda7429..daed7dd 100644
--- a/server/requirements.txt
+++ b/server/requirements.txt
@@ -1,5 +1,6 @@
gabriel-server==2.1.1
opencv-python>=3, <5
+pillow<7
torchvision>=0.3, <0.5
py-cpuinfo
#required for MS Face Cognitive Service
Thanks, but unfortunately, I am still getting the same issue, therefore, I believe that the issue is related to the GPU Driver. I would try to test it on the Ubuntu 18.04 with GPU and update if it works.
Thanks for your support
log from the dev docker container build locally
$docker run --gpus all --rm -it -p 9099:9099 cmusatyalab/openrtist:dev
[setupvars.sh] OpenVINO environment initialized
==============NVSMI LOG==============
Timestamp : Wed May 10 21:01:25 2023
Driver Version : 530.30.02
CUDA Version : 12.1
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 3060 Laptop GPU
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-5e96b00a-ab98-d2de-f742-144327a66a53
Minor Number : 0
VBIOS Version : 94.06.1E.00.28
MultiGPU Board : No
Board ID : 0x100
Board Part Number : N/A
GPU Part Number : 2520-775-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x252010DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x16F21043
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Device Current : 1
Device Max : 4
Host Max : 4
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 2000 KB/s
Rx Throughput : 1000 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 6144 MiB
Reserved : 206 MiB
Used : 169 MiB
Free : 5768 MiB
BAR1 Memory Usage
Total : 8192 MiB
Used : 5 MiB
Free : 8187 MiB
Compute Mode : Default
Utilization
Gpu : 16 %
Memory : 4 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 105 C
GPU Slowdown Temp : 102 C
GPU Max Operating Temp : 87 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 11.03 W
Power Limit : 80.00 W
Default Power Limit : 80.00 W
Enforced Power Limit : 55.00 W
Min Power Limit : 1.00 W
Max Power Limit : 95.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 7001 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 662.500 mV
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1914
Type : G
Name :
Used GPU Memory : 45 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3460
Type : G
Name :
Used GPU Memory : 62 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3843
Type : G
Name :
Used GPU Memory : 55 MiB
INFO:__main__:Detected GPU / CUDA support
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/gabriel_server/local_engine.py", line 67, in _run_engine
engine = engine_factory()
File "./main.py", line 118, in engine_setup
adapter = create_adapter(args.openvino, args.cpu_only, args.torch, args.myriad)
File "./main.py", line 44, in create_adapter
return TorchAdapter(False, DEFAULT_STYLE)
File "/openrtist/server/torch_adapter.py", line 73, in __init__
_ = self.inference(preprocessed)
File "/openrtist/server/torch_adapter.py", line 87, in inference
output = self.style_model(preprocessed)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/openrtist/server/transformer_net.py", line 63, in forward
y = self.relu(self.in1(self.conv1(X)))
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/openrtist/server/transformer_net.py", line 87, in forward
out = self.conv2d(out)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Hi @sachinkum0009,
I also tested it on my laptop this morning. Ubuntu 22.04. The nvidia drivers had already been installed as part of the OS installation. I installed docker, the nvidia-container-toolkit, added my user to the docker group so I could execute containers without root and then just pulled/ran the stable container with docker run --gpus all --rm -it -p 9099:9099 cmusatyalab/openrtist:stable
. My recommendation would be to start from scratch if possible. If that machine isn't used for other purposes, start with a clean install. If that is not possible try removing all nvidia drivers (using either apt purge or the runfile depending on how you installed them).
Thanks for your support I tried to do the installation from scratch using the anaconda version of pytorch. It is working fine. I believe the issue is more related to the cuda version mismatch.
Hi, Greetings
I am facing issue with the Openrtist server running on the docker with GPU. Can you please help with this issue? Below is the log error.