Windows: Example code doesn't run

antithing commented 1 year ago

Hi, thank you for making this available!

I am running on Windows, and when running the example:

python launch.py --config configs/neus-dtu.yaml --gpu 0 --train

I see the following error:


  File "D:\NERF\NEUS\instant-nsr-pl\systems\neus.py", line 95, in training_step
    train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples_full'].sum().item()))
ZeroDivisionError: division by zero

The full output is below, what can I do to resolve this?

Thank you!


Global seed set to 42
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type      | Params
------------------------------------
0 | model | NeuSModel | 28.0 M
------------------------------------
28.0 M    Trainable params
0         Non-trainable params
28.0 M    Total params
55.913    Total estimated model params size (MB)
Traceback (most recent call last):
  File "D:\NERF\NEUS\instant-nsr-pl\launch.py", line 128, in <module>
    main()
  File "D:\NERF\NEUS\instant-nsr-pl\launch.py", line 117, in main
    trainer.fit(system, datamodule=dm)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
    results = self._run_stage()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1191, in _run_stage
    self._run_train()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 213, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 202, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 249, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 370, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\core\module.py", line 1754, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\core\optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\plugins\precision\native_amp.py", line 75, in optimizer_step
    closure_result = closure()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 149, in __call__
    self._result = self.closure(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 135, in closure
    step_output = self._step_fn()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 419, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1494, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\strategies\dp.py", line 134, in training_step
    return self.model(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\parallel\data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\overrides\data_parallel.py", line 77, in forward
    output = super().forward(*inputs, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\overrides\base.py", line 98, in forward
    output = self._forward_module.training_step(*inputs, **kwargs)
  File "D:\NERF\NEUS\instant-nsr-pl\systems\neus.py", line 95, in training_step
    train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples_full'].sum().item()))
ZeroDivisionError: division by zero
Epoch 0: : 0it [01:12, ?it/s]
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

antithing commented 1 year ago

I have edited neus.py to add a check:

        # update train_num_rays
        if out['num_samples_full'].sum().item() > 0:
            if self.config.model.dynamic_ray_sampling:
                train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples_full'].sum().item()))        
                self.train_num_rays = min(int(self.train_num_rays * 0.9 + train_num_rays * 0.1), self.config.model.max_train_num_rays)

And training is now working! However it crashes while trying to create the mesh:

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing DataLoader 0: 100%|████████████████████████████████████████████████████████████| 60/60 [07:52<00:00,  7.88s/it]Traceback (most recent call last):
  File "D:\NERF\NEUS\instant-nsr-pl\launch.py", line 128, in <module>
    main()
  File "D:\NERF\NEUS\instant-nsr-pl\launch.py", line 118, in main
    trainer.test(system, datamodule=dm)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 794, in test
    return call._call_and_handle_interrupt(
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 842, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
    results = self._run_stage()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1188, in _run_stage
    return self._run_evaluate()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1228, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\loop.py", line 206, in run
    output = self.on_run_end()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 180, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 288, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "D:\NERF\NEUS\instant-nsr-pl\systems\neus.py", line 259, in test_epoch_end
    self.export()
  File "D:\NERF\NEUS\instant-nsr-pl\systems\neus.py", line 262, in export
    mesh = self.model.export(self.config.export)
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\NERF\NEUS\instant-nsr-pl\models\neus.py", line 315, in export
    mesh = self.isosurface()
  File "D:\NERF\NEUS\instant-nsr-pl\models\neus.py", line 114, in isosurface
    mesh = self.geometry.isosurface()
  File "C:\Users\B\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "D:\NERF\NEUS\instant-nsr-pl\models\geometry.py", line 108, in isosurface
    vmin, vmax = mesh_coarse['v_pos'].amin(dim=0), mesh_coarse['v_pos'].amax(dim=0)
IndexError: amin(): Expected reduction dim 0 to have non-zero size.
Testing DataLoader 0: 100%|██████████| 60/60 [08:27<00:00,  8.45s/it]
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

bennyguo commented 1 year ago

Hi! Which scene are you working on? Does the rendered video look correct? It seems that the scene is empty.

antithing commented 1 year ago

Hi, thanks for getting back to me. I have tried a few different DTU datasets. I am on windows,, could this be related to the fix in the fix-data-win branch?

Thanks!

bennyguo commented 1 year ago

Hi! This could be related to https://github.com/bennyguo/instant-nsr-pl/issues/45. You may try running the fix-data-win branch and see if you can get any luck. Note that this branch has not been updated for a very long time, so if you could successfully train with this branch, I'll try to update it to the latest HEAD.

antithing commented 1 year ago

Hi @bennyguo, I can confirm that using the fix-win branch runs perfectly (I needed to downgrade the pytorch lightning version) and I can run on the nerf-synthetic dataset, and export a model. If you can add this fix to the latest code, that would be amazing! Thank you again.

Idonotno commented 8 months ago

Hello! I still get an error, may I ask why?

bennyguo / instant-nsr-pl

Windows: Example code doesn't run #90