The Process Freezes When Dealing with a Large Number of Frames

jxbb233 commented 9 months ago

你好，我尝试在kitti数据集的一条完整的序列上（超过1000frames）运行demo，发现经常在运行一段时间后，整个进程无缘无故的卡住，当所需处理的frames很少时则不会出现这样的情况。这样的情况主要发生在post-processing steps时，卡住后并没有相关报错信息，用ps -ef查看进程本身也没有结束。以下是运行时log的最后几行。

********** current num kfs: 20 **********
frame id 613
trans  tensor([-68.3958,  15.3260,  -1.3550], device='cuda:0', grad_fn=<SubBackward0>)
frame id 614
trans  tensor([-69.7998,  15.3240,  -1.3688], device='cuda:0', grad_fn=<SubBackward0>)
frame id 615
trans  tensor([-71.2139,  15.3187,  -1.3821], device='cuda:0', grad_fn=<SubBackward0>)
frame id 616
trans  tensor([-72.5929,  15.3241,  -1.3938], device='cuda:0', grad_fn=<SubBackward0>)
frame id 617
trans  tensor([-73.9971,  15.3187,  -1.4026], device='cuda:0', grad_fn=<SubBackward0>)
frame id 618
trans  tensor([-75.4045,  15.3136,  -1.4108], device='cuda:0', grad_fn=<SubBackward0>)
insert keyframe
********** current num kfs: 21 **********
frame id 619
trans  tensor([-76.7938,  15.3104,  -1.4320], device='cuda:0', grad_fn=<SubBackward0>)
frame id 620
trans  tensor([-78.1841,  15.2990,  -1.4462], device='cuda:0', grad_fn=<SubBackward0>)
frame id 621
trans  tensor([-79.5537,  15.2961,  -1.4579], device='cuda:0', grad_fn=<SubBackward0>)
********** post-processing steps **********

  0%|          | 0/22 [00:00<?, ?it/s]
 post-processing steps:   0%|          | 0/22 [00:00<?, ?it/s]

以下是我运行的代码。

python demo/run.py configs/kitti/kitti_06.yaml

为了可视化，我对kitti.yaml做了以下更改。

debug_args:
  mesh_freq: 10

感谢您的帮助！

JunyuanDeng commented 9 months ago

您好，我之前确实没遇到过这样的问题，我猜想会不会是因为keyframe缓存太多，导致的资源占用过大，您可以试着调整这个值把这里的20调成2试试，如果可以的话再增大。

jxbb233 commented 9 months ago

更改这个数值后确实可以运行更久了，但是后面会报以下这个错。

********** current num kfs: 2 **********
frame id 956
trans  tensor([150.7139,   2.9822,   0.5144], device='cuda:0', grad_fn=<SubBackward0>)
frame id 957
trans  tensor([152.0837,   3.0972,   0.5227], device='cuda:0', grad_fn=<SubBackward0>)
/usr/local/lib/python3.8/dist-packages/torch/functional.py:599: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2315.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/NeRF-LOAM-master/src/mapping.py", line 98, in spin
    tracked_frame = kf_buffer.get()
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/reductions.py", line 297, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 508, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 752, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

我尝试了两次，这个报错会稳定触发。另外，我查看log后发现运行时完全没有进入post-processing steps，这是正常现象吗？感谢您的帮助！

JunyuanDeng commented 8 months ago

After he change the GPU to rtx8000, the problem seems to be solved, the problem may be the unsuficient memory in docker environment. So I close this problem.

JunyuanDeng / NeRF-LOAM

The Process Freezes When Dealing with a Large Number of Frames #17