_hash_encoder Error while trying to train

HamzaOuajhain commented 7 months ago

First of all thank you for your work. I am trying to run nicer-slam and having difficulty doing so they use the implementation of monosdf :

I get this error :

and this is the call stack :

also for the configuration of gcc, cuda, cudnn :

I have tried the suggestion from Issue #19 but with no success sadly.

would you be able to help ?

niujinshuchong commented 7 months ago

Hi, it seems like 1070 is not on this list here https://github.com/cvg/nicer-slam/blob/main/code/hashencoder/backend.py#L10-L26. Maybe you need to change it.

HamzaOuajhain commented 7 months ago

Thank you for your reply, I managed to fix that problem, but when I try to train network I get an out of memory error. It is mentioned in the readme file that we should lower the batch size since I only have a 8 Go GPU

batch_size=1 in eval_rendering but I doubt it means that,

I also tried changing 'batch_size = ground_truth["rgb"].shape[0]' in Line 285 of file volsdf_train.py but without success, this is the full error.

python training/exp_runner.py --conf confs/runconf_demo_1.conf shell command : training/exp_runner.py --conf confs/runconf_demo_1.conf Loading data ... Finish loading data. build_directory ./tmp_build_1070/ Detected CUDA files, patching ldflags Emitting ninja build file ./tmp_build_1070/build.ninja... Building extension module _hash_encoder_1070... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module _hash_encoder_1070... running... 0%| | 0/200 [00:00<?, ?it/s]/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in _hash_encodeBackward. Traceback of forward call that caused the error: File "training/exp_runner.py", line 54, in <module> trainrunner.run() File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/training/volsdf_train.py", line 558, in run model_outputs = self.model( File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/model/network.py", line 129, in forward rgb_flat = self.rendering_network( File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/model/base_networks.py", line 336, in forward grid_feature = self.encoding(points / self.divide_factor) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 210, in forward outputs = hash_encode(inputs, self.embeddings, self.offsets, self.per_level_scale, self.base_resolution, inputs.requires_grad) (Triggered internally at /opt/conda/conda-bld/pytorch_1646755903507/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 0%| | 0/200 [00:01<?, ?it/s] Traceback (most recent call last): File "training/exp_runner.py", line 54, in <module> trainrunner.run() File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/training/volsdf_train.py", line 577, in run loss.backward() File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, *args) File "/home/aspegique/anaconda3/envs/nicer-slam/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 135, in decorate_bwd return bwd(*args, **kwargs) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 64, in backward grad_inputs, grad_embeddings = _hash_encode_second_backward.apply(grad, inputs, embeddings, offsets, B, D, C, L, S, H, calc_grad_inputs, dy_dx) File "/home/aspegique/Desktop/repos/nicer-slam/code/../code/hashencoder/hashgrid.py", line 85, in forward grad_embeddings = torch.zeros_like(embeddings) RuntimeError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 0; 7.91 GiB total capacity; 4.51 GiB already allocated; 1000.12 MiB free; 5.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

doddodod commented 4 months ago

similar problem: /root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++17 or later compatible compiler is required to use ATen. 4 | #error C++17 or later compatible compiler is required to use ATen. | ^~~~~ ninja: build stopped: subcommand failed. rank0: Traceback (most recent call last): rank0: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build

rank0: File "/root/miniconda3/envs/monosdf/lib/python3.8/subprocess.py", line 516, in run rank0: raise CalledProcessError(retcode, process.args, rank0: subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '1']' returned non-zero exit status 1.

rank0: The above exception was the direct cause of the following exception:

rank0: Traceback (most recent call last): rank0: File "training/exp_runner.py", line 58, in rank0: trainrunner = MonoSDFTrainRunner(conf=opt.conf, rank0: File "/root/autodl-tmp/monosdf/code/../code/training/monosdf_train.py", line 107, in init rank0: self.model = utils.get_class(self.conf.get_string('train.model_class'))(conf=conf_model) rank0: File "/root/autodl-tmp/monosdf/code/../code/utils/general.py", line 18, in get_class rank0: m = import(module) rank0: File "/root/autodl-tmp/monosdf/code/../code/model/network.py", line 140, in rank0: from hashencoder.hashgrid import _hash_encode, HashEncoder rank0: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/init.py", line 1, in rank0: from .hashgrid import HashEncoder rank0: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/hashgrid.py", line 12, in rank0: from .backend import _backend rank0: File "/root/autodl-tmp/monosdf/code/../code/hashencoder/backend.py", line 10, in rank0: _backend = load(name='_hash_encoder', rank0: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1309, in load rank0: return _jit_compile( rank0: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile

rank0: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1832, in _write_ninja_file_and_build_library

rank0: File "/root/miniconda3/envs/monosdf/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2123, in _run_ninja_build rank0: raise RuntimeError(message) from e rank0: RuntimeError: Error building extension '_hash_encoder'

autonomousvision / monosdf

_hash_encoder Error while trying to train #96