torch.cuda.OutOfMemoryError - Githubissues

JunyuanDeng / NeRF-LOAM

[ICCV2023] NeRF-LOAM: Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry and Mapping

MIT License

491 stars 29 forks source link

torch.cuda.OutOfMemoryError #15

Open iason-r opened 6 months ago

iason-r commented 6 months ago

我使用的是16G的GPU，请问这个报错和数据集大小有关系吗

JunyuanDeng commented 6 months ago

16G应该是够的，您可以尝试降低batch size，最简单的做法就是把这一行的chunksize之间除以10：chunk_size//10。当然，这个会大大降低渲染速度

iason-r commented 6 months ago

请问需要修改的是哪个文件的chunk_size

JunyuanDeng commented 6 months ago

你可以点上面的超链接

iason-r commented 6 months ago

我修改了chunk_size为chunk_size//10之后依然报错 Traceback (most recent call last): File "/home/sucronav/.conda/envs/torch/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/sucronav/.conda/envs/torch/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/sucronav/renbin/NeRF-LOAM/src/mapping.py", line 112, in spin self.do_mapping(share_data, tracked_frame) File "/home/sucronav/renbin/NeRF-LOAM/src/mapping.py", line 179, in do_mapping bundle_adjust_frames( File "/home/sucronav/renbin/NeRF-LOAM/src/variations/render_helpers.py", line 398, in bundle_adjust_frames final_outputs = render_rays( File "/home/sucronav/renbin/NeRF-LOAM/src/variations/render_helpers.py", line 211, in render_rays intersections, hits = ray_intersect( File "/home/sucronav/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/sucronav/renbin/NeRF-LOAM/src/variations/voxel_helpers.py", line 534, in ray_intersect pts_idx, min_depth, max_depth = svo_ray_intersect( File "/home/sucronav/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/home/sucronav/renbin/NeRF-LOAM/src/variations/voxel_helpers.py", line 108, in forward children = children.expand(S G, children.size()).contiguous() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.61 GiB (GPU 0; 7.79 GiB total capacity; 979.16 MiB already allocated; 2.48 GiB free; 1016.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 程序运行过程中，我每秒刷新一次nvidia-smi 据我观察，gpu内存占用应该是没有达到100%的

iason-r commented 6 months ago

Screenshot from 2024-01-03 16-18-50

JunyuanDeng commented 6 months ago

8G的显存确实有点少了，我之前没有写过把程序放在两张卡上运行过，不知道怎么修改。如果可以的话，尽量用16G以上的显存。

iason-r commented 5 months ago

我使用24g的显存跑，但我跑了接近24个小时，才跑到68%，请问这正常吗 Screenshot from 2024-01-16 09-33-17

JunyuanDeng commented 5 months ago

嗯，目前确实是越跑越慢的，是我们目前的优化方向，您可以使用subscene分支，加快速度。记得fetch新的git更新

iason-r commented 5 months ago

emm，24g显存为什么也报OutOfMemory insert keyframe ** current num kfs: 18 ** tracking frame: 99%|███████████████████████████████████████████████████████████████████████████████████▎| 4501/4540 [47:26:23<44:59, 69.22s/it]Process Process-2: Traceback (most recent call last): File "/home/rb/anaconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/rb/anaconda3/envs/torch/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/rb/NeRF-LOAM/src/mapping.py", line 112, in spin self.do_mapping(share_data, tracked_frame) File "/home/rb/NeRF-LOAM/src/mapping.py", line 179, in do_mapping bundle_adjust_frames( File "/home/rb/NeRF-LOAM/src/variations/render_helpers.py", line 398, in bundle_adjust_frames final_outputs = render_rays( File "/home/rb/NeRF-LOAM/src/variations/render_helpers.py", line 211, in render_rays intersections, hits = ray_intersect( File "/home/rb/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/rb/NeRF-LOAM/src/variations/voxel_helpers.py", line 534, in ray_intersect pts_idx, min_depth, max_depth = svo_ray_intersect( File "/home/rb/anaconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/home/rb/NeRF-LOAM/src/variations/voxel_helpers.py", line 108, in forward children = children.expand(S G, children.size()).contiguous() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.50 GiB (GPU 0; 23.69 GiB total capacity; 5.14 GiB already allocated; 5.22 GiB free; 7.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

iason-r commented 5 months ago

对了，之前因为 sampled_rays_d = frame.rays_d[sample_mask].cuda()报错RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) 所以我把 sample_mask = frame.sample_mask.cuda() sampled_rays_d = frame.rays_d[sample_mask].cuda() 改成了 sample_mask = frame.sample_mask.cuda() sample_mask = sample_mask.cuda() frame.rays_d = frame.rays_d.cuda() sampled_rays_d = frame.rays_d[sample_mask] 不知道对整体有什么影响吗

JunyuanDeng commented 5 months ago

Tried to allocate 5.50 GiB (GPU 0; 23.69 GiB total capacity; 5.14 GiB already allocated; 5.22 GiB free; 7.45 GiB reserved in total by PyTorch) 5G已经分配，7.5G reserved，理论上24G的卡，你还有11.5G的显存剩余，请问您还运行这别的算法吗，你可以用subscene分支来跑，这样会更快。

iason-r commented 5 months ago

作者您好，我这一段时间一直在尝试运行咱们的包，但是一直卡在OutOfMemory这个问题，在解决这个问题的过程中我想学习一下咱们的代码，请问关于学习我们的代码以及深度学习这方面您有什么建议吗，我今年研一，入学之后一直在做项目，刚开始做科研，之前没有深入了解过深度学习

JunyuanDeng commented 5 months ago

如果没有了解过深度学习，你可以先学习一下Dive-into-DL-PyTorch这本书，中英文版都有。当然你也可以先学习一下机器学习，如果时间赶就不用了。学了基本的深度学习知识后，如果你想知道slam的东西，还得学习一下SLAM14讲这本书，了解基础知识。然后这些看完就有基本的slam和深度学习知识，之后再看看nerf的pytorch实现代码，了解nerf的原理，最后就可以看看有哪些nerf-slam的内容了，比如这个库，找找里面引用和star多的库看看。

iason-r commented 5 months ago

好的，谢谢您了这两天跑通了，但是跑了几次都跑飞了，感觉定位不太好呀

JunyuanDeng commented 5 months ago

是跑kitti吗？如果是别的场景，可能需要调整一下学习率。

iason-r commented 5 months ago

跑的kitti00

hhongwei1009 commented 2 weeks ago

好的，谢谢您了这两天跑通了，但是跑了几次都跑飞了，感觉定位不太好呀

请问你是怎么跑通的啊，我也陷入了out of memory