CUDA Status Execution Failed

sachinkum0009 commented 1 year ago

Hi, Greetings

I am facing issue with the Openrtist server running on the docker with GPU. Can you please help with this issue? Below is the log error.

syncio:Task exception was never retrieved
future: <Task finished coro=<_LocalServer._engine_comm() done, defined at /usr/local/lib/python3.6/dist-packages/gabriel_server/local_engine.py:71> exception=IncompleteReadError('0 bytes read on a total of 4 expected bytes',)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/gabriel_server/local_engine.py", line 78, in _engine_comm
    _NUM_BYTES_FOR_SIZE)
  File "/usr/lib/python3.6/asyncio/streams.py", line 672, in readexactly
    raise IncompleteReadError(incomplete, n)
asyncio.streams.IncompleteReadError: 0 bytes read on a total of 4 expected bytes
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/gabriel_server/local_engine.py", line 94, in _run_engine
    engine = engine_factory()
  File "./main.py", line 113, in engine_setup
    args.myriad)
  File "./main.py", line 44, in create_adapter
    return TorchAdapter(False, DEFAULT_STYLE)
  File "/openrtist/server/torch_adapter.py", line 73, in __init__
    _ = self.inference(preprocessed)
  File "/openrtist/server/torch_adapter.py", line 87, in inference
    output = self.style_model(preprocessed)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/openrtist/server/transformer_net.py", line 63, in forward
    y = self.relu(self.in1(self.conv1(X)))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/openrtist/server/transformer_net.py", line 86, in forward
    out = self.conv2d(out)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

jaharkes commented 1 year ago

How are you starting your docker container. It depends a lot on the exact versions of the docker, nvidia-docker, docker-compose and Cuda drivers, it seems like they change the way devices are discovered and passed through to the container every other release.

sachinkum0009 commented 1 year ago

Thanks for you response I am using the command

docker run --gpus all --rm -it -p 9099:9099 cmusatyalab/openrtist:stable

It gives the error message. I have installed the cuda drivers and cudnn drivers on the Ubuntu 20.04 I also tried to use the installation from source for gpu, but it also has the same issue with the gpu CUDNN_STATUS_EXECUTION_FAILED

Please guide me how can solve this issue. Thanks. :heart:

teiszler commented 1 year ago

Can you paste the output of nvidia-smi on both the host and from within the container? You can get a shell inside the container by simply running docker run --gpus all --rm -it --entrypoint /bin/bash cmusatyalab/openrtist:stable

sachinkum0009 commented 1 year ago

Thanks for the response I have attached the output from nvidia-smi from the host and from the container.

Host Log

Tue May  9 13:20:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   69C    P5    18W /  N/A |    603MiB /  5946MiB |     21%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1800      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      3518      G   /usr/lib/xorg/Xorg                223MiB |
|    0   N/A  N/A      3900      G   /usr/bin/gnome-shell               64MiB |
|    0   N/A  N/A      5471      G   ...383287598542382230,131072       52MiB |
|    0   N/A  N/A      9892      G   gnome-control-center                2MiB |
|    0   N/A  N/A     11364      G   ...RendererForSitePerProcess       10MiB |
|    0   N/A  N/A     23662      G   gzserver                           99MiB |
|    0   N/A  N/A     23685      G   gzclient                           92MiB |
+-----------------------------------------------------------------------------+

docker container log

Tue May  9 11:48:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   65C    P8    16W /  N/A |    432MiB /  5946MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

teiszler commented 1 year ago

It seems like CUDA was only partially installed. Typically the cuda installation also includes all the cudnn libraries. How did you install cuda initially? You may have to uninstall (with the runfile or apt purge nvidia cuda) and then reinstall. I usually use this page to select my os/arch and select the local deb option.

teiszler commented 1 year ago

This matrix may help us hone in on the issue. Depending on your OS there are particular minimum kernels that are supported (which you can find with uname -r). It does appear that your nvidia driver version is supported by CUDA 11.4 as the minimum is 450.80.02 for Linux.

sachinkum0009 commented 1 year ago

I tried to install the cuda 12.1 and it has been installed correctly But still it gives the same error as before Adding the log for the nvidia-smi for host and docker

Host log

Tue May  9 16:07:28 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 L...    On | 00000000:01:00.0 Off |                  N/A |
| N/A   56C    P8               10W /  55W|    593MiB /  6144MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1951      G   /usr/lib/xorg/Xorg                           45MiB |
|    0   N/A  N/A      3471      G   /usr/lib/xorg/Xorg                          196MiB |
|    0   N/A  N/A      3864      G   /usr/bin/gnome-shell                         92MiB |
|    0   N/A  N/A      7973      G   ...76114573,9971327053354293095,262144      186MiB |
|    0   N/A  N/A     22960      G   ...,WinRetrieveSuggestionsOnlyOnDemand       61MiB |
+---------------------------------------------------------------------------------------+

for host container

Tue May  9 14:09:56 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 L...    On | 00000000:01:00.0 Off |                  N/A |
| N/A   54C    P8                9W /  55W|    553MiB /  6144MiB |     15%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

sachinkum0009 commented 1 year ago

I also tried to test the installation from source and it also gives the same error message. Can you please help me to solve this issue.

Thanks :)

teiszler commented 1 year ago

When do you receive this error? Is it when the container first launches? Is it after you connect a client and try to infer an image? Does it work if you use docker run --rm -it -p 9099:9099 cmusatyalab/openrtist:stable to run with CPU only? There is an asyncio error prior to the cudnn one in your stack trace so I am wondering if the cudnn message is an effect rather than a cause. If the container launches and loads the model (I believe it prints a message that it has finished initialization) then the model is getting loaded onto the GPU.

sachinkum0009 commented 1 year ago

Yes, when the container first launches, after 1 min (approx) this error comes. No, I am waiting for the container to initialize the model and print message to connect the client. Yes, It works with docker run --rm -it -p 9099:9099 cmusatyalab/openrtist:stable with CPU only. I didn't see this message any message initialization finished because of the error. cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

teiszler commented 1 year ago

We never determined which kernel you are using. What does uname -r reveal? Does that kernel meet the minimum required in the matrix I posted earlier?

sachinkum0009 commented 1 year ago

Yes, output of uname -r is 5.15.0-69-generic

teiszler commented 1 year ago

How did you obtain the container image? Did you pull it from Docker hub or build it locally? It looks like stable and latest images on Docker hub could be out of date, so I wonder if you build it locally (docker build -t cmusatyalab/openrtist:dev . when in the openrtist root directory) and then launch it using the dev tag if it succeeds.

sachinkum0009 commented 1 year ago

Output for nvidia-smi

$ nvidia-smi
Wed May 10 14:24:56 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 L...    On | 00000000:01:00.0 Off |                  N/A |
| N/A   56C    P8               10W /  55W|    301MiB /  6144MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1953      G   /usr/lib/xorg/Xorg                           45MiB |
|    0   N/A  N/A      3645      G   /usr/lib/xorg/Xorg                           93MiB |
|    0   N/A  N/A      4272      G   /usr/bin/gnome-shell                         52MiB |
|    0   N/A  N/A     16537      G   ...937287336,443561402869045485,262144       99MiB |
+---------------------------------------------------------------------------------------+

Output for nvcc

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

sachinkum0009 commented 1 year ago

How did you obtain the container image? Did you pull it from Docker hub or build it locally? It looks like stable and latest images on Docker hub could be out of date, so I wonder if you build it locally (docker build -t cmusatyalab/openrtist:dev . when in the openrtist root directory) and then launch it using the dev tag if it succeeds.

Thanks I will try this

sachinkum0009 commented 1 year ago

Build the container locally and it gives error related to PIL

INFO:__main__:Detected GPU / CUDA support
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/gabriel_server/local_engine.py", line 67, in _run_engine
    engine = engine_factory()
  File "./main.py", line 118, in engine_setup
    adapter = create_adapter(args.openvino, args.cpu_only, args.torch, args.myriad)
  File "./main.py", line 42, in create_adapter
    from torch_adapter import TorchAdapter
  File "/openrtist/server/torch_adapter.py", line 35, in <module>
    from torchvision import transforms
  File "/usr/local/lib/python3.7/dist-packages/torchvision/__init__.py", line 4, in <module>
    from torchvision import datasets
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/__init__.py", line 9, in <module>
    from .fakedata import FakeData
  File "/usr/local/lib/python3.7/dist-packages/torchvision/datasets/fakedata.py", line 3, in <module>
    from .. import transforms
  File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/__init__.py", line 1, in <module>
    from .transforms import *
  File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/transforms.py", line 17, in <module>
    from . import functional as F
  File "/usr/local/lib/python3.7/dist-packages/torchvision/transforms/functional.py", line 5, in <module>
    from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/usr/local/lib/python3.7/dist-packages/PIL/__init__.py)

teiszler commented 1 year ago

Let me build it locally and see if I can reproduce. Once we get the local version building I can push it to Docker hub since that seems to be dated. We used to have CI that did this, but it appears to have stopped working.

sachinkum0009 commented 1 year ago

Thanks a lot

teiszler commented 1 year ago

I was able to reproduce locally. It looks like PILLOW_VERSION was removed in PIL 7. torchvision > 0.5.0 apparently has a fix for this (by using version instead of PILLOW_VERSION), but rather than change the torchvision version, I simply added pillow<7 to server/requirements.txt. If you add this and then rebuild the image, it should work. Alternatively you could wait for me to push the change to github and the new image to Docker hub, but I may not get around to doing that today.

sachinkum0009 commented 1 year ago

Thanks a lot I will try to make these changes and test.

teiszler commented 1 year ago

diff --git a/server/requirements.txt b/server/requirements.txt
index cda7429..daed7dd 100644
--- a/server/requirements.txt
+++ b/server/requirements.txt
@@ -1,5 +1,6 @@
 gabriel-server==2.1.1
 opencv-python>=3, <5
+pillow<7
 torchvision>=0.3, <0.5
 py-cpuinfo
 #required for MS Face Cognitive Service

sachinkum0009 commented 1 year ago

Thanks, but unfortunately, I am still getting the same issue, therefore, I believe that the issue is related to the GPU Driver. I would try to test it on the Ubuntu 18.04 with GPU and update if it works.

Thanks for your support

log from the dev docker container build locally

$docker run --gpus all --rm -it -p 9099:9099 cmusatyalab/openrtist:dev
[setupvars.sh] OpenVINO environment initialized

==============NVSMI LOG==============

Timestamp                                 : Wed May 10 21:01:25 2023
Driver Version                            : 530.30.02
CUDA Version                              : 12.1

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3060 Laptop GPU
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-5e96b00a-ab98-d2de-f742-144327a66a53
    Minor Number                          : 0
    VBIOS Version                         : 94.06.1E.00.28
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 2520-775-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x252010DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x16F21043
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 2000 KB/s
        Rx Throughput                     : 1000 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 6144 MiB
        Reserved                          : 206 MiB
        Used                              : 169 MiB
        Free                              : 5768 MiB
    BAR1 Memory Usage
        Total                             : 8192 MiB
        Used                              : 5 MiB
        Free                              : 8187 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 16 %
        Memory                            : 4 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 30 C
        GPU Shutdown Temp                 : 105 C
        GPU Slowdown Temp                 : 102 C
        GPU Max Operating Temp            : 87 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 11.03 W
        Power Limit                       : 80.00 W
        Default Power Limit               : 80.00 W
        Enforced Power Limit              : 55.00 W
        Min Power Limit                   : 1.00 W
        Max Power Limit                   : 95.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 662.500 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1914
            Type                          : G
            Name                          : 
            Used GPU Memory               : 45 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3460
            Type                          : G
            Name                          : 
            Used GPU Memory               : 62 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3843
            Type                          : G
            Name                          : 
            Used GPU Memory               : 55 MiB

INFO:__main__:Detected GPU / CUDA support
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/gabriel_server/local_engine.py", line 67, in _run_engine
    engine = engine_factory()
  File "./main.py", line 118, in engine_setup
    adapter = create_adapter(args.openvino, args.cpu_only, args.torch, args.myriad)
  File "./main.py", line 44, in create_adapter
    return TorchAdapter(False, DEFAULT_STYLE)
  File "/openrtist/server/torch_adapter.py", line 73, in __init__
    _ = self.inference(preprocessed)
  File "/openrtist/server/torch_adapter.py", line 87, in inference
    output = self.style_model(preprocessed)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/openrtist/server/transformer_net.py", line 63, in forward
    y = self.relu(self.in1(self.conv1(X)))
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/openrtist/server/transformer_net.py", line 87, in forward
    out = self.conv2d(out)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

teiszler commented 1 year ago

Hi @sachinkum0009, I also tested it on my laptop this morning. Ubuntu 22.04. The nvidia drivers had already been installed as part of the OS installation. I installed docker, the nvidia-container-toolkit, added my user to the docker group so I could execute containers without root and then just pulled/ran the stable container with docker run --gpus all --rm -it -p 9099:9099 cmusatyalab/openrtist:stable. My recommendation would be to start from scratch if possible. If that machine isn't used for other purposes, start with a clean install. If that is not possible try removing all nvidia drivers (using either apt purge or the runfile depending on how you installed them).

sachinkum0009 commented 1 year ago

Thanks for your support I tried to do the installation from scratch using the anaconda version of pytorch. It is working fine. I believe the issue is more related to the cuda version mismatch.

cmusatyalab / openrtist