ashawkey / stable-dreamfusion

Text-to-3D & Image-to-3D & Mesh Exportation with NeRF + Diffusion.
Apache License 2.0
8.04k stars 713 forks source link

3D object generation is slower and getting NaNs with the flag --cuda_ray #144

Closed marcinplata closed 1 year ago

marcinplata commented 1 year ago

Description

Hi, I am using RTX 2080 16 GB (laptop version) and during generating a 3D object I am getting NaN pretty quickly (during the first epoch). Moreover, the training takes longer, i.e., with --cuda_ray it is around 2.8it/s, while using PyTorch, it is around 3.1it/s.

I installed everything following the description in the readme.md and had no issues.

Steps to Reproduce

Execute the script: python main.py --text "a beef hamburger on a ceramic plate" --workspace trial -O.

Then I am getting in the console:

==> Start Training trial Epoch 1, lr=0.050000 ...
  0% 0/100 [00:00<?, ?it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   1% 1/100 [00:00<01:24,  1.17it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   2% 2/100 [00:01<01:00,  1.62it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   3% 3/100 [00:01<00:53,  1.83it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   4% 4/100 [00:02<00:49,  1.94it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   5% 5/100 [00:02<00:47,  2.01it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   6% 6/100 [00:03<00:45,  2.05it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :   7% 7/100 [00:03<00:44,  2.08it/s]NaN or Inf found in input tensor.

While only changing line number 92 in main.py to: opt.cuda_ray = False

I am getting in the console:

==> Start Training trial Epoch 1, lr=0.050000 ...
loss=0.0000 (0.0000), lr=0.050000: : 100% 100/100 [00:31<00:00,  3.14it/s]
==> Finished Epoch 1.
  0% 0/5 [00:00<?, ?it/s]++> Evaluate trial_sd_xffa at epoch 1 ...
loss=0.0000 (0.0000): : 100% 5/5 [00:02<00:00,  1.76it/s]
++> Evaluate epoch 1 Finished.
==> Start Training trial_sd_xffa Epoch 2, lr=0.050000 ...
loss=0.0000 (0.0000), lr=0.050000: : 100% 100/100 [00:32<00:00,  3.12it/s]
==> Finished Epoch 2.

and I am able to generate a nice 3D object.

Expected Behavior

Not getting NaNs during training and speedup training.

Environment

Ubuntu 20.04, conda environment, Python 3.10, PyTorch 1.13.1, CUDA 11.7.1, cudnn 8.5.0

claforte commented 1 year ago

I'm facing a similar issue on an A100. The code works fine on my RTX4090, although I set up the environment a few weeks ago. I wonder if there's a recent regression in the dependencies? What version of python are you using?

marcinplata commented 1 year ago

I'm facing a similar issue on an A100. The code works fine on my RTX4090, although I set up the environment a few weeks ago. I wonder if there's a recent regression in the dependencies? What version of python are you using?

Sorry, I missed it in the environment description. I am using Python 3.10.

claforte commented 1 year ago

In case it may help, here are the key differences between the two PCs. However, I'm using a heavily-modified fork of Stable-DreamFusion so it's very possible that the problem I'm facing is specific to my code and hyperparams... left is the config that doesn't work, right is the config that works... image

claforte commented 1 year ago

I updated the A100's venv to use the exact same versions of Python (3.9.13) and pip packages as my local RTX4090... but unfortunately that didn't resolve the problem. I'm starting to wonder if the problem is specific to the NVIDIA driver (510.73.08)...

marcinplata commented 1 year ago

Thanks @claforte, I observed that NaNs happen faster (in the first epoch) after the latest updates with raymarching, among others. Before (commit d0517e2c0d1f13de9f35472d45d8da98acfa1777) I was getting the NaNs around the 6th epoch using pretty same dependencies (just installed xformers). Surprsly, on A100 I was able to train 3D objects without getting NaNs using the code from d0517e2c0d1f13de9f35472d45d8da98acfa1777 with the latest packages and Lambda Labs drivers configuration.

claforte commented 1 year ago

What NVIDIA driver do you have for each environment? I think older versions might the root cause. It works fine for me on 525.x, but fails on 510.x

claforte commented 1 year ago

BTW in my fork I'm almost current (i.e. it includes commit https://github.com/ashawkey/stable-dreamfusion/commit/d0517e2c0d1f13de9f35472d45d8da98acfa1777) and it works fine on my local PC. Never gets any NaNs.

MathieuTuli commented 1 year ago

I don't have this issue, but you might benefit from stepping through the program and seeing which tensors become/are NAN. It might help debug what component is causing this, and what may be incompatible with drivers etc.

elliottzheng commented 1 year ago

Hi, I have the nan issue both on the new raymarching(https://github.com/ashawkey/stable-dreamfusion/commit/0198976c8cdaab20a0f7c4cb217f708ee3603c45) and the old raymarching(https://github.com/ashawkey/stable-dreamfusion/commit/d0517e2c0d1f13de9f35472d45d8da98acfa1777), actually, the issue occurs very randomly (as the repo itself can not make sure deterministic with same seed), with the exact same setup, I tried 8 runs, and I might get 0~3 nans, while the rest is normal.

The issue might be related to the mixed-precision training, as when I disabled FP16, I get very stable training, as all 8 runs successfully, while I get degraded performance(worse shape), anyway, that's another issue.

It might related to this issue https://github.com/pytorch/pytorch/issues/40497#issuecomment-669011975

If you print out the scale of GradScaler before nan occurs, you will find it is very small, about 1e-39, and the small scale of GradScaler would cause anything becomes nan.

The reason for the small scale is that when the GradScaler found nan in grad(it's common when using AMP, as fp16 would cause overflow), it will skip the optimizer.step, and reduce the scale by the scale_factor 2.

In those nan cases, it's just too many nans, that the scale becomes too small quickly, and finally causing anything becomes nan.

ashawkey commented 1 year ago

The speed issue can be somehow explained: during the training NeRF rendering is not the major speed bottleneck (the stable-diffusion denoising step is). I get similar training speed (5its/s) on V100 with cuda_ray on or off, but the cuda-ray mode should be faster in rendering (5its/s vs 1it/s). Besides, the non-cuda-ray mode currently only samples 64+32 points per ray, which is relatively fewer than cuda-ray mode (at most 1024, but on average ~100 points per ray).

ashawkey commented 1 year ago

Sorry that I don't have another type of GPU to test the NaN issue, but turnning off fp16 mode could be helpful. You could also uncomment the anomaly detection line to spot where gives the first NaN.

MathieuTuli commented 1 year ago

Interesting, note that mixed precision training isn't compatible with all types of gpus (I believe you need Pascal or newer?). For sanity, I would check whether your GPU specifically can run mixed precision training, or in other words, support fp16

marcinplata commented 1 year ago

Interesting, note that mixed precision training isn't compatible with all types of gpus (I believe you need Pascal or newer?). For sanity, I would check whether your GPU specifically can run mixed precision training, or in other words, support fp16

@claforte mentioned that the issue occured on A100 as well.

MathieuTuli commented 1 year ago

good point. were you able to step through the program and pinpoint the source of the nan?

claforte commented 1 year ago

On my side at least, NaNs were caused in run_cuda because raymarching.composite_rays_train(sigmas, rgbs, ts, rays, T_thresh) returned an empty weights tensor. I'm pretty sure it's caused by the old NVIDIA 510.x driver that's installed on that machine. I'm now trying to debug the non-CUDA, FP32 code path since it doesn't produce the results I'm expecting in my scenario.

marcinplata commented 1 year ago

I also updated the Nvidia drivers to the latest 525.x.x and the NaNs issue is gone. Now it does not occur in both modes: --cuda_ray and PyTorch raymarching.

Moreover, training is now faster for --cuda_ray. I achieve 3.5 it/s (vs. 2.8 it/s using older drivers) in --cuda_ray mode, which is a similar result to PyTorch raymarching.