bennyguo / instant-nsr-pl

Neural Surface reconstruction based on Instant-NGP. Efficient and customizable boilerplate for your research projects. Train NeuS in 10min!
MIT License
842 stars 84 forks source link

ZeroDivisionError: division by zero error #45

Open kwin120 opened 1 year ago

kwin120 commented 1 year ago

Hi, I run both neus and nerf, and I got the same ZeroDivisionError in systems\neus.py and systems\nerf.py. Here's the cmd output for running neus: Global seed set to 42 Using 16bit native Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used.. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | model | NeuSModel | 12.6 M

12.6 M Trainable params 0 Non-trainable params 12.6 M Total params 25.221 Total estimated model params size (MB) Traceback (most recent call last): File "G:\GitHub\instant-nsr-pl\launch.py", line 123, in main() File "G:\GitHub\instant-nsr-pl\launch.py", line 112, in main trainer.fit(system, datamodule=dm) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 696, in fit self._call_and_handle_interrupt( File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1166, in _run results = self._run_stage() File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1252, in _run_stage return self._run_train() File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1283, in _run_train self.fit_loop.run() File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\loop.py", line 200, in run self.advance(*args, *kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 271, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\loop.py", line 200, in run self.advance(args, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\loop.py", line 200, in run self.advance(*args, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\loop.py", line 200, in run self.advance(*args, *kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1550, in _call_lightning_module_hook output = fn(args, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\core\module.py", line 1705, in optimizer_step optimizer.step(closure=optimizer_closure) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\core\optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\plugins\precision\native_amp.py", line 85, in optimizer_step closure_result = closure() File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 146, in call self._result = self.closure(*args, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 132, in closure step_output = self._step_fn() File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\loops\optimization\optimizer_loop.py", line 407, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", kwargs.values()) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1704, in _call_strategy_hook output = fn(args, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\strategies\dp.py", line 134, in training_step return self.model(*args, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\parallel\data_parallel.py", line 169, in forward return self.module(inputs[0], kwargs[0]) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\overrides\data_parallel.py", line 65, in forward output = super().forward(*inputs, *kwargs) File "C:\Users\halbe\AppData\Local\Programs\Python\Python310\lib\site-packages\pytorch_lightning\overrides\base.py", line 79, in forward output = self.module.training_step(inputs, kwargs) File "G:\GitHub\instant-nsr-pl\systems\neus.py", line 86, in training_step train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples'].sum().item())) ZeroDivisionError: division by zero Epoch 0: : 0it [01:22, ?it/s] [W ..\torch\csrc\CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice) [W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice) [W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice)`

bennyguo commented 1 year ago

Hi! Which CUDA, PyTorch and tinycudann version are you using? Could you replace all the FullyFusedMLP in the config file with VanillaMLP and try again?

kwin120 commented 1 year ago

PyTorch: 1.13.1+cu117 CUDA: 11.7 tinycudann: 1.7 I've replaced the FullyFusedMLPwith VanillaMLPin both neus-blender.yaml and nerf-blender.yaml. I still got the same error.

MvWouden commented 1 year ago

I'm having the exact same issues when running both NeRF and NeuS. I'll let you know if I find something.

I ran the commands:

# For NeRF
python launch.py --config configs/nerf-blender.yaml --gpu 0 --train dataset.scene=lego tag=lego-nerf

# For NeuS
python launch.py --config configs/neus-blender.yaml --gpu 0 --train dataset.scene=lego tag=lego-neus system.loss.lambda_mask=0.0

I'm thinking this may have to do with either the configuration file or the nerf_synthetic dataset. Will try to find the answer.

bennyguo commented 1 year ago

@MvWouden Which OS are you on? Linux or Windows?

MvWouden commented 1 year ago

@MvWouden Which OS are you on? Linux or Windows?

Currently running on Windows unfortunately. I've tried replacing the forwards slash with a backwards slash for the config path, that didn't seem to help so I've ruled that one out. I'll try debugging the out variable to see if it's missing the 'num_samples' now.

I do think the issue may be Windows-specific.

Update:

The out variable when the error occurs is:

{
    'comp_rgb': tensor([...], device='cuda:0'),
    'opacity': tensor([...], device='cuda:0'),
    'depth': tensor([...], device='cuda:0'),
    'rays_valid': tensor([...], device='cuda:0'),
    'num_samples': tensor([0], device='cuda:0', dtype=torch.int32),
    'weights': tensor([], device='cuda:0', grad_fn=<ViewBackward0>),
    'points': tensor([], device='cuda:0'),
    'intervals': tensor([], device='cuda:0'),
    'ray_indices': tensor([], device='cuda:0', dtype=torch.int64)
}

Note that I've not included the content of all tensors to reduce verbosity.

We can see here that out['num_samples'] is tensor([0], device='cuda:0', dtype=torch.int32). This leads to a division by zero error in the line: train_num_rays = int(self.train_num_rays * (self.train_num_samples / out['num_samples'].sum().item()))

bennyguo commented 1 year ago

I think it could have something to do with WIndows. Let me try on my PC and I'll let you guys know how it turns out.

MvWouden commented 1 year ago

I think it could have something to do with WIndows. Let me try on my PC and I'll let you guys know how it turns out.

Thanks! Please see my updated previous comment for more info.

MvWouden commented 1 year ago

More info (printed variables):

batch_idx=0
out["num_samples"].sum().item()=0
batch={'rays': tensor([[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        ...,
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]], device='cuda:0'), 'rgb': tensor([[0.7296, 0.1194, 0.0511],
        [0.7296, 0.1194, 0.0511],
        [0.7296, 0.1194, 0.0511],
        ...,
        [0.7296, 0.1194, 0.0511],
        [0.7296, 0.1194, 0.0511],
        [0.7296, 0.1194, 0.0511]], device='cuda:0'), 'fg_mask': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        ...,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0')}

I also see that rays_o and rays_d before ray marching are zero-filled tensors. This causes ray_indices to be [], which I think leads to the problem we're having.

Just to confirm I have the right data setup. I have the folder load >> nerf_synthetic >> <scene>, where each scene contains the folders test, train, val and files transforms_test.json , transforms_train.json, transforms_val.json.

MvWouden commented 1 year ago

I think it could have something to do with WIndows. Let me try on my PC and I'll let you guys know how it turns out.

If there's anything I can do to help, please let me know! I'll be happy to test things out on my system.

bennyguo commented 1 year ago

@MvWouden Which PyTorch and CUDA version are you using? I want to build the same environment. Thanks!

MvWouden commented 1 year ago

@MvWouden Which PyTorch and CUDA version are you using? I want to build the same environment. Thanks!

python: 3.10.5 cuda: 11.7 torch: 1.13.1+cu117 tinycudann: 1.7

This seems to be the same as @kwin120.

It may also be worth to mention that I've installed the tiny-cuda-nn PyTorch extension from a local clone according to their instructions with:

tiny-cuda-nn$ cd bindings/torch
tiny-cuda-nn/bindings/torch$ python setup.py install

While installing tiny-cuda-nn PyTorch extension, there were some issues I had to resolve before I could install, hence why installed from a local clone. The "simple" pip install may not work out-of-the-box. If you encounter any issues in the installation process please let me know, I may know how to fix them already.

I appreciate your help!

bennyguo commented 1 year ago

I just did a little digging and found that PyTorch-Lightning may have messed up the data I transferred to GPU when initializing the dataset (variables below): https://github.com/bennyguo/instant-nsr-pl/blob/2d8970ddf2cf405e99d09652560ed62e4d1aa7a5/datasets/blender.py#L68-L70 I don't know how it did it and it seems one has to dig very deep into the pl source code to find out why (and this is platform-related according to current observation).

The most intuitive approach would be to leave the data on the CPU and transfer the processed data to GPU before each iteration starts. I'm testing whether this would affect training efficiency.

MvWouden commented 1 year ago

I just did a little digging and found that PyTorch-Lightning may have messed up the data I transferred to GPU when initializing the dataset (variables below):

https://github.com/bennyguo/instant-nsr-pl/blob/2d8970ddf2cf405e99d09652560ed62e4d1aa7a5/datasets/blender.py#L68-L70

I don't know how it did it and it seems one has to dig very deep into the pl source code to find out why (and this is platform-related according to current observation).

The most intuitive approach would be to leave the data on the CPU and transfer the processed data to GPU before each iteration starts. I'm testing whether this would affect training efficiency.

Thanks so much for the investigation, keep me posted! It's a rather strange issue.

bennyguo commented 1 year ago

I just pushed a new branch to fix this issue. It seems to be ~20% slower if we did it this way (additional data transfer time). I'll leave this issue open while thinking about better solutions.

43 is also related.

MvWouden commented 1 year ago

I just pushed a new branch to fix this issue. It seems to be ~20% slower if we did it this way (additional data transfer time). I'll leave this issue open while thinking about better solutions.

43 is also related.

I can confirm that this fix does resolve the issues I was having. Thanks so much for taking the time to have a look!

kwin120 commented 1 year ago

I just pushed a new branch to fix this issue. It seems to be ~20% slower if we did it this way (additional data transfer time). I'll leave this issue open while thinking about better solutions.

43 is also related.

The issue is solved on my end as well. Thank you so much!