Reduce VRAM usage - Githubissues

YerldSHO commented 5 months ago

✨ Pixi task (nrgbd_wr in default): python -m neural_graph_mapping.run_mapping --config nrgbd_dataset.yaml neural_graph_map.yaml coslam_eval.yaml --dataset_config.root_dir $NGM_DATA_DIR/nrgbd/ --dataset_config.scene whiteroom $NGM_EXTRA_ARGS --rerun_vis True /home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 12, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Traceback (most recent call last): File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 2428, in main() File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 2421, in main neural_graph_map.fit() File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 1032, in fit self._init_mv_training_data() File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/utils.py", line 83, in wrapper result = f(*args, *kwargs) File "/home/alex/projects/neural_graph_mapping/src/neural_graph_mapping/run_mapping.py", line 1692, in _init_mv_training_data self._nc_rgbd_tensor = torch.empty( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.58 GiB. GPU 0 has a total capacity of 5.78 GiB of which 66.44 MiB is free. Process 11099 has 5.30 GiB memory in use. Including non-PyTorch memory, this process has 116.00 MiB memory in use. Of the allocated memory 9.94 MiB is allocated by PyTorch, and 12.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Exception in thread Thread-1 (_pin_memory_loop): Traceback (most recent call last): File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/threading.py", line 953, in run self._target(self._args, **self._kwargs) File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 53, in _pin_memory_loop do_one_step() File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 30, in do_one_step r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd fd = df.detach() File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 508, in Client answer_challenge(c, authkey) File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/home/alex/projects/neural_graph_mapping/.pixi/envs/default/lib/python3.10/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer

Good afternoon, I was running your code and came across this problem, what could be the problem and how can I solve it?

roym899 commented 5 months ago

You are running out of VRAM.

The code and parameters were tested with an RTX 4090 (24 GB). I believe there are a few easy ways to save memory (reducing fields optimized in parallel, reducing number of rays per field), but getting it down to your 6 GB might require a few changes to the code (currently we preallocate a buffer for the keyframes, they could also be loaded on demand from disk instead and / or downsampled without noticable loss in quality).

I'll spend a bit of time on this to see if I can come up with a low VRAM configuration.

YerldSHO commented 5 months ago

Thanks for your answer, I'm looking forward to next it. I will also continue working on the memory solution because this affects more than one project

I would be grateful if you leave a few points from which I can start searching

KTH-RPL / neural_graph_mapping

Reduce VRAM usage #5