RuntimeError: CUDA error: invalid argument

OrangeSodahub commented 3 months ago

Hi, I followed your instructions of install:

(hifa) $ python -V
Python 3.9.19
(hifa) $ pip show torch
Name: torch
Version: 2.0.0+cu117
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/.../envs/hifa/lib/python3.9/site-packages
Requires: filelock, jinja2, networkx, sympy, triton, typing-extensions
Required-by: accelerate, carvekit-colab, invisible-watermark, pytorch-lightning, taming-transformers, torch-ema, torchmetrics, torchvision, triton
(hifa) $ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

However when I run main.py this error occurs:

│    921 │   │   │   │   self.save_checkpoint(full=True, best=False)                               │
│                                                                                                  │
│ /home/.../HiFA/nerf/utils.py:1192 in train_one_epoch                                        │
│                                                                                                  │
│   1189 │   │   │                                                                                 │
│   1190 │   │   │   # loss.backward()                                                             │
│   1191 │   │   │   start = time.time()                                                           │
│ ❱ 1192 │   │   │   self.scaler.scale(loss).backward()                                            │
│   1193 │   │   │                                                                                 │
│   1194 │   │   │   self.post_train_step()                                                        │
│   1195 │   │   │   # self.optimizer.step()                                                       │
│                                                                                                  │
│ /home/.../envs/hifa/lib/python3.9/site-packages/torch/_tensor.py:487 in     │
│ backward                                                                                         │
│                                                                                                  │
│    484 │   │   │   │   create_graph=create_graph,                                                │
│    485 │   │   │   │   inputs=inputs,                                                            │
│    486 │   │   │   )                                                                             │
│ ❱  487 │   │   torch.autograd.backward(                                                          │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    489 │   │   )                                                                                 │
│    490                                                                                           │
│                                                                                                  │
│ /home/.../envs/hifa/lib/python3.9/site-packages/torch/autograd/__init__.py: │
│ 200 in backward                                                                                  │
│                                                                                                  │
│   197 │   # The reason we repeat same the comment below is that                                  │
│   198 │   # some Python versions print out the first line of a multi-line function               │
│   199 │   # calls in the traceback and some print out the last line                              │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   203                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

JunzheJosephZhu commented 3 months ago

https://github.com/JunzheJosephZhu/HiFA/issues/9#issuecomment-2014161812

JiuTongBro commented 3 months ago

+1

JunzheJosephZhu / HiFA

RuntimeError: CUDA error: invalid argument #10