Closed marcinplata closed 1 year ago
I'm facing a similar issue on an A100. The code works fine on my RTX4090, although I set up the environment a few weeks ago. I wonder if there's a recent regression in the dependencies? What version of python are you using?
I'm facing a similar issue on an A100. The code works fine on my RTX4090, although I set up the environment a few weeks ago. I wonder if there's a recent regression in the dependencies? What version of python are you using?
Sorry, I missed it in the environment description. I am using Python 3.10.
In case it may help, here are the key differences between the two PCs. However, I'm using a heavily-modified fork of Stable-DreamFusion so it's very possible that the problem I'm facing is specific to my code and hyperparams... left is the config that doesn't work, right is the config that works...
I updated the A100's venv to use the exact same versions of Python (3.9.13) and pip packages as my local RTX4090... but unfortunately that didn't resolve the problem. I'm starting to wonder if the problem is specific to the NVIDIA driver (510.73.08)...
Thanks @claforte, I observed that NaNs happen faster (in the first epoch) after the latest updates with raymarching, among others. Before (commit d0517e2c0d1f13de9f35472d45d8da98acfa1777) I was getting the NaNs around the 6th epoch using pretty same dependencies (just installed xformers). Surprsly, on A100 I was able to train 3D objects without getting NaNs using the code from d0517e2c0d1f13de9f35472d45d8da98acfa1777 with the latest packages and Lambda Labs drivers configuration.
What NVIDIA driver do you have for each environment? I think older versions might the root cause. It works fine for me on 525.x, but fails on 510.x
BTW in my fork I'm almost current (i.e. it includes commit https://github.com/ashawkey/stable-dreamfusion/commit/d0517e2c0d1f13de9f35472d45d8da98acfa1777) and it works fine on my local PC. Never gets any NaNs.
I don't have this issue, but you might benefit from stepping through the program and seeing which tensors become/are NAN. It might help debug what component is causing this, and what may be incompatible with drivers etc.
Hi, I have the nan issue both on the new raymarching(https://github.com/ashawkey/stable-dreamfusion/commit/0198976c8cdaab20a0f7c4cb217f708ee3603c45) and the old raymarching(https://github.com/ashawkey/stable-dreamfusion/commit/d0517e2c0d1f13de9f35472d45d8da98acfa1777), actually, the issue occurs very randomly (as the repo itself can not make sure deterministic with same seed), with the exact same setup, I tried 8 runs, and I might get 0~3 nans, while the rest is normal.
The issue might be related to the mixed-precision training, as when I disabled FP16, I get very stable training, as all 8 runs successfully, while I get degraded performance(worse shape), anyway, that's another issue.
It might related to this issue https://github.com/pytorch/pytorch/issues/40497#issuecomment-669011975
If you print out the scale of GradScaler before nan occurs, you will find it is very small, about 1e-39, and the small scale of GradScaler would cause anything becomes nan.
The reason for the small scale is that when the GradScaler found nan in grad(it's common when using AMP, as fp16 would cause overflow), it will skip the optimizer.step
, and reduce the scale by the scale_factor 2.
In those nan cases, it's just too many nans, that the scale becomes too small quickly, and finally causing anything becomes nan.
The speed issue can be somehow explained: during the training NeRF rendering is not the major speed bottleneck (the stable-diffusion denoising step is).
I get similar training speed (5its/s) on V100 with cuda_ray
on or off, but the cuda-ray mode should be faster in rendering (5its/s vs 1it/s).
Besides, the non-cuda-ray mode currently only samples 64+32 points per ray, which is relatively fewer than cuda-ray mode (at most 1024, but on average ~100 points per ray).
Sorry that I don't have another type of GPU to test the NaN issue, but turnning off fp16 mode could be helpful. You could also uncomment the anomaly detection line to spot where gives the first NaN.
Interesting, note that mixed precision training isn't compatible with all types of gpus (I believe you need Pascal or newer?). For sanity, I would check whether your GPU specifically can run mixed precision training, or in other words, support fp16
Interesting, note that mixed precision training isn't compatible with all types of gpus (I believe you need Pascal or newer?). For sanity, I would check whether your GPU specifically can run mixed precision training, or in other words, support fp16
@claforte mentioned that the issue occured on A100 as well.
good point. were you able to step through the program and pinpoint the source of the nan?
On my side at least, NaNs were caused in run_cuda
because raymarching.composite_rays_train(sigmas, rgbs, ts, rays, T_thresh)
returned an empty weights
tensor. I'm pretty sure it's caused by the old NVIDIA 510.x driver that's installed on that machine. I'm now trying to debug the non-CUDA, FP32 code path since it doesn't produce the results I'm expecting in my scenario.
I also updated the Nvidia drivers to the latest 525.x.x and the NaNs issue is gone. Now it does not occur in both modes: --cuda_ray
and PyTorch raymarching.
Moreover, training is now faster for --cuda_ray
. I achieve 3.5 it/s (vs. 2.8 it/s using older drivers) in --cuda_ray
mode, which is a similar result to PyTorch raymarching.
Description
Hi, I am using RTX 2080 16 GB (laptop version) and during generating a 3D object I am getting NaN pretty quickly (during the first epoch). Moreover, the training takes longer, i.e., with
--cuda_ray
it is around 2.8it/s, while using PyTorch, it is around 3.1it/s.I installed everything following the description in the
readme.md
and had no issues.Steps to Reproduce
Execute the script:
python main.py --text "a beef hamburger on a ceramic plate" --workspace trial -O
.Then I am getting in the console:
While only changing line number 92 in
main.py
to:opt.cuda_ray = False
I am getting in the console:
and I am able to generate a nice 3D object.
Expected Behavior
Not getting NaNs during training and speedup training.
Environment
Ubuntu 20.04, conda environment, Python 3.10, PyTorch 1.13.1, CUDA 11.7.1, cudnn 8.5.0