Why reg_loss become zero in some dataset?

zubeyirgenc commented 2 years ago

When i try repo in bob.json with default parameters in bob.json file, it run along 30 step but after that print reg_loss=nan. I try with bigger batch_size and multi GPU succesfully finish and export obj files.

But when try with nerf dataset the reg_loss always become nan after 10th iteration. I think it happens because of small batch size and increase GPU number but any parameter set run succesful (use 4 V100 GPU with 16GB memory)

The error log is

Loading extension module renderutils_plugin...
iter=    0, img_loss=0.393313, reg_loss=nan, lr=0.02999, time=195.8 ms, rem=16.32 m
Traceback (most recent call last):
  File "/home/Desktop/nvdiffrec/train.py", line 594, in <module>
    geometry, mat = optimize_mesh(glctx, geometry, mat, lgt, dataset_train, dataset_validate, 
  File "/home/Desktop/nvdiffrec/train.py", line 415, in optimize_mesh
    img_loss, reg_loss = trainer(target, it)
  File "/root/miniconda3/envs/dmodel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/Desktop/nvdiffrec/train.py", line 299, in forward
    return self.geometry.tick(glctx, target, self.light, self.material, self.image_loss_fn, it)
  File "/home/Desktop/nvdiffrec/geometry/dmtet.py", line 219, in tick
    buffers = self.render(glctx, target, lgt, opt_material)
  File "/home/Desktop/nvdiffrec/geometry/dmtet.py", line 210, in render
    return render.render_mesh(glctx, opt_mesh, target['mvp'], target['campos'], lgt, target['resolution'], spp=target['spp'], 
  File "/home/Desktop/nvdiffrec/render/render.py", line 214, in render_mesh
    assert mesh.t_pos_idx.shape[0] > 0, "Got empty training triangle mesh (unrecoverable discontinuity)"
AssertionError: Got empty training triangle mesh (unrecoverable discontinuity)

emperor1412 commented 2 years ago

I got the same issue running on colab with 1 Tesla P100 GPU 16GB VRAM

JHnvidia commented 1 year ago

Hi,

Unfortunately I cannot recreate this error locally. The error means that the training/optimization diverges, once optimization removes all geometry from the scene you get stuck in a non-recoverable state where no geometry can be created. Typically, this happens when you get NaNs during optimization.

Since it's not happening for me, I suspect some conflict with the 3rd party packages, so I would recommend creating a new anaconda container and doing the installation steps again. If you want to debug. If you want to debug you can uncomment the line at: https://github.com/NVlabs/nvdiffrec/blob/main/train.py#L42, which enables NaN tracking in pytorch.

zubeyirgenc commented 1 year ago

Thanks for your reply and advices.

Unfortunately I cannot recreate this error locally. The error means that the training/optimization diverges, once optimization removes all geometry from the scene you get stuck in a non-recoverable state where no geometry can be created. Typically, this happens when you get NaNs during optimization.

This came to mind when I saw the error message and try with new parameters that reduce the learning rate because of maybe it converges quickly. But different learning rates did not run succesfuly too.

Since it's not happening for me, I suspect some conflict with the 3rd party packages, so I would recommend creating a new anaconda container and doing the installation steps again. If you want to debug. If you want to debug you can uncomment the line at: https://github.com/NVlabs/nvdiffrec/blob/main/train.py#L42, which enables NaN tracking in pytorch.

When i try with uncomment the mentioned line, it generate more detailed error but i can not solve this error too.

DatasetNERF: 200 images with shape [800, 800]
Encoder output: 32 dims
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/renderutils_plugin/build.ninja...
Building extension module renderutils_plugin...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module renderutils_plugin...
iter=    0, img_loss=0.397599, reg_loss=0.333960, lr=0.02999, time=364.9 ms, rem=30.41 m
/root/miniconda3/envs/dmodel/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in MulBackward0. Traceback of forward call that caused the error:
  File "/home/Desktop/nvdiffrec/train.py", line 594, in <module>
    geometry, mat = optimize_mesh(glctx, geometry, mat, lgt, dataset_train, dataset_validate,
  File "/home/Desktop/nvdiffrec/train.py", line 415, in optimize_mesh
    img_loss, reg_loss = trainer(target, it)
  File "/root/miniconda3/envs/dmodel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/Desktop/nvdiffrec/train.py", line 299, in forward
    return self.geometry.tick(glctx, target, self.light, self.material, self.image_loss_fn, it)
  File "/home/Desktop/nvdiffrec/geometry/dmtet.py", line 236, in tick
    reg_loss += torch.mean(buffers['kd_grad'][..., :-1] * buffers['kd_grad'][..., -1:]) * 0.03 * min(1.0, iteration / 500)
 (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484806139/work/torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/home/Desktop/nvdiffrec/train.py", line 594, in <module>
    geometry, mat = optimize_mesh(glctx, geometry, mat, lgt, dataset_train, dataset_validate, 
  File "/home/Desktop/nvdiffrec/train.py", line 428, in optimize_mesh
    total_loss.backward()
  File "/root/miniconda3/envs/dmodel/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/miniconda3/envs/dmodel/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'MulBackward0' returned nan values in its 1th output.

If you want to recreate the error in your local machine maybe our docker image can help you. docker pull zubeyirgenc/nerf:v2.1 You can pull this image and the repo directory is /home/Desktop/nvdiffrec

JHnvidia commented 1 year ago

Hi,

We've tried a bit more to recreate the error, but are unable. The error is quite strange, as the kd_grad regularizer just directly takes the texture albedo. My best guess is that tinycudann for some reason returns +-inf values, causing NaNs in the mul operator.

If you want to spend further time debugging I would suggest spamming in some assert torch.all(torch.isfinite(x)) tests, particularly checking if there are infs in kd_grad: https://github.com/NVlabs/nvdiffrec/blob/main/render/render.py#L50 or the alpha mask https://github.com/NVlabs/nvdiffrec/blob/main/render/render.py#L101

NVlabs / nvdiffrec

Why reg_loss become zero in some dataset? #79