CarperAI / DRLX

Diffusion Reinforcement Learning Library
MIT License
174 stars 7 forks source link

CUDA Error on second epoch #28

Open nbardy opened 1 year ago

nbardy commented 1 year ago

Seeing an unknown CUDA error on the second epoch. Will try to debug more tomorrow.

Traceback (most recent call last):
  File "/home/paperspace/git/DRLX/train_aesthetics.py", line 12, in <module>
    trainer.train(pipe, Aesthetics())
  File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in train
    if self.config.train.total_samples is not None:
  File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in <listcomp>
    if self.config.train.total_samples is not None:
  File "/home/paperspace/.pyenv/versions/3.9.17/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/git/DRLX/src/drlx/denoisers/ldm_unet.py", line 125, in postprocess
    images = images.detach().cpu().permute(0,2,3,1).numpy()
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last): File "/home/paperspace/git/DRLX/train_aesthetics.py", line 12, in trainer.train(pipe, Aesthetics()) File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in train if self.config.train.total_samples is not None: File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in if self.config.train.total_samples is not None: File "/home/paperspace/.pyenv/versions/3.9.17/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/paperspace/git/DRLX/src/drlx/denoisers/ldm_unet.py", line 125, in postprocess images = images.detach().cpu().permute(0,2,3,1).numpy() RuntimeError: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

nbardy commented 1 year ago
(base) paperspace@psy0glj6t:~$ nvidia-smi
Unable to determine the device handle for GPU0000:00:05.0: Unknown Error

Also seems to have borked the GPU enough to need a restart.