CUDA Error: Unknown Error

aptrn commented 1 year ago

Hello,

I'm trying to run stable-dreamfusion on windows via Docker using WSL2, runnin on a RTX 3070Ti 8Gb. The issue is, using any kind of script without the "--test" flag, I always get to a "RuntimeError: CUDA error: unknown error".

Here's the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.05    Driver Version: 522.25       CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 30%   42C    P5    26W / 310W |    658MiB /  8192MiB |     22%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is the command I used to test:

CUDA_LAUNCH_BLOCKING=1 python3 main.py --text "a hamburger" --workspace trial --fp16 --save_mesh

And here's the output:

Namespace(H=800, O=False, O2=False, W=800, albedo_iters=1000, angle_front=60, angle_overhead=30, backbone='grid', bg_radius=1.4, bound=1, ckpt='latest', cuda_ray=False, density_thresh=10, dir_text=False, dt_gamma=0, eval_interval=10, fovy=60, fovy_range=[40, 70], fp16=True, gui=False, guidance='stable-diffusion', h=64, iters=10000, jitter_pose=False, lambda_entropy=0.0001, lambda_opacity=0, lambda_orient=0.01, lambda_smooth=0, light_phi=0, light_theta=60, lr=0.001, max_ray_batch=4096, max_spp=1, max_steps=1024, min_near=0.1, negative='', negative_dir_text=False, num_steps=64, radius=3, radius_range=[1.0, 1.5], save_mesh=True, seed=0, test=False, text='a hamburger', update_extra_interval=16, upsample_steps=64, w=64, workspace='trial')
NeRFNetwork(
  (encoder): GridEncoder: input_dim=3 num_levels=16 level_dim=2 resolution=16 -> 2048 per_level_scale=1.3819 params=(903480, 2) gridtype=tiled align_corners=False
  (sigma_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=32, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=64, bias=True)
      (2): Linear(in_features=64, out_features=4, bias=True)
    )
  )
  (encoder_bg): FreqEncoder: input_dim=3 degree=6 output_dim=39
  (bg_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=39, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=3, bias=True)
    )
  )
)
[INFO] loaded hugging face access token from ./TOKEN!
[INFO] loading stable diffusion...
[INFO] loaded stable diffusion!
[INFO] Trainer: df | 2022-11-02_14-40-19 | cuda | fp16 | trial
[INFO] #parameters: 1816247
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training trial Epoch 1, lr=0.010000 ...
  0% 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 157, in <module>
    trainer.train(train_loader, valid_loader, max_epoch)
  File "/home/folder/nerf/utils.py", line 486, in train
    self.train_one_epoch(train_loader)
  File "/home/folder/nerf/utils.py", line 706, in train_one_epoch
    pred_rgbs, pred_ws, loss = self.train_step(data)
  File "/home/folder/nerf/utils.py", line 379, in train_step
    loss = self.guidance.train_step(text_z, pred_rgb)
  File "/home/folder/nerf/sd.py", line 98, in train_step
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_condition.py", line 296, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_blocks.py", line 563, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 162, in forward
    hidden_states = block(hidden_states, context=context)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 211, in forward
    hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 283, in forward
    hidden_states = self._attention(query, key, value)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py", line 291, in _attention
    attention_scores = torch.matmul(query, key.transpose(-1, -2)) * self.scale
RuntimeError: CUDA error: unknown error
  0% 0/100 [00:01<?, ?it/s]

I've read around that I need 12Gb of VRAM to use this but wanted to try it anyway. The error doesn't seem to explicitly say that I need more VRAM so I'm asking here to have a double check by more expert eyes than mine.

bunswoDS commented 1 year ago

I ran out of memory running it on Google Colab Nvidia Tesla T4's after about 7 epochs -- so switching to A100's with high memory to see if that works.

yuanzhi-zhu commented 1 year ago

It seems that in WSL the error message is not always explicit (most probably OOM in your case). But you can check the GPU memory usage in task manager.

ashawkey / stable-dreamfusion

CUDA Error: Unknown Error #70