out of memory - Githubissues

I got cuda out of memory every time I continued training. But I won't get error if I load initial weight and train from the first epoch. I used my own dataset, but I thought is more likely something wrong with distributed training? Any suggestion how should I check the code?

traing from epoch 0 and everything works well

Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.671s (7.671s) Stage0-heatmaps: 2.215e-03 (2.215e-03) Stage1-heatmaps: 6.406e-04 (6.406e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.953e-08 (4.953e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][0/167] Time: 9.341s (9.341s) Speed: 2.1 samples/s Data: 7.873s (7.873s) Stage0-heatmaps: 1.990e-03 (1.990e-03) Stage1-heatmaps: 5.832e-04 (5.832e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 4.789e-08 (4.789e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][100/167] Time: 0.539s (0.651s) Speed: 37.1 samples/s Data: 0.000s (0.101s) Stage0-heatmaps: 4.487e-04 (1.019e-03) Stage1-heatmaps: 4.257e-04 (5.118e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.724e-07 (4.452e-07) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [0][100/167] Time: 0.541s (0.651s) Speed: 36.9 samples/s Data: 0.000s (0.099s) Stage0-heatmaps: 4.705e-04 (1.050e-03) Stage1-heatmaps: 4.493e-04 (5.196e-04) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 3.321e-07 (4.364e-07) Stage1-pull: 0.000e+00 (0.000e+00)
=> saving checkpoint to output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3

continue training and get error

Target Transforms (if any): None=> loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' => loading checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' => loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5) => loaded checkpoint 'output/coco_kpt/pose_higher_hrnet/w32_512_adam_lr1e-3/checkpoint.pth.tar' (epoch 5) Epoch: [5][0/167] Time: 9.577s (9.577s) Speed: 2.1 samples/s Data: 8.164s (8.164s) Stage0-heatmaps: 1.595e-04 (1.595e-04) Stage1-heatmaps: 7.866e-05 (7.866e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 6.155e-08 (6.155e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Epoch: [5][0/167] Time: 9.665s (9.665s) Speed: 2.1 samples/s Data: 7.976s (7.976s) Stage0-heatmaps: 1.904e-04 (1.904e-04) Stage1-heatmaps: 8.872e-05 (8.872e-05) Stage0-push: 0.000e+00 (0.000e+00) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 5.090e-08 (5.090e-08) Stage1-pull: 0.000e+00 (0.000e+00)
Traceback (most recent call last): File "tools/dist_train.py", line 323, in main() File "tools/dist_train.py", line 115, in main args=(ngpus_per_node, args, final_output_dir, tb_log_dir) File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/dist_train.py", line 285, in main_worker final_output_dir, tb_log_dir, writer_dict, fp16=cfg.FP16.ENABLED) File "/kpoints/HigherHRNet-Human-Pose-Estimation/tools/../lib/core/trainer.py", line 76, in do_train loss.backward() File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 52.00 MiB (GPU 0; 7.80 GiB total capacity; 5.73 GiB already allocated; 27.31 MiB free; 5.86 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fa6f3122536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so) frame #1: + 0x1cf1e (0x7fa6f336bf1e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so) frame #2: + 0x1df9e (0x7fa6f336cf9e in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so) frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optional) + 0x135 (0x7fa6f5f00535 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xf7a66b (0x7fa6f44f866b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so) frame #5: + 0xfc3f57 (0x7fa6f4541f57 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so) frame #6: + 0x1075389 (0x7fa730a7c389 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #7: + 0x10756c7 (0x7fa730a7c6c7 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #8: + 0xe3c42e (0x7fa73084342e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #9: at::TensorIterator::fast_set_up() + 0x5cf (0x7fa7308442af in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #10: at::TensorIterator::build() + 0x4c (0x7fa730844b6c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #11: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x146 (0x7fa730845216 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #12: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x3a (0x7fa730564eba in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #13: + 0xf76ef8 (0x7fa6f44f4ef8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cuda.so) frame #14: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #15: + 0x2d2e779 (0x7fa732735779 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #16: + 0x10c3ec0 (0x7fa730acaec0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #17: at::Tensor::mul(at::Tensor const&) const + 0xf0 (0x7fa73f108ab0 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #18: torch::autograd::generated::PowBackward0::apply(std::vector<at::Tensor, std::allocator >&&) + 0x1a6 (0x7fa7322caa06 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #19: + 0x2d89c05 (0x7fa732790c05 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node, torch::autograd::InputBuffer&) + 0x16f3 (0x7fa73278df03 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #21: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x7fa73278ece2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #22: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fa732787359 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so) frame #23: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fa73eec64d8 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so) frame #24: + 0xbd66f (0x7fa73ff9766f in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #25: + 0x76db (0x7fa742c906db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #26: clone + 0x3f (0x7fa742fc988f in /lib/x86_64-linux-gnu/libc.so.6)

root@a2bff378da93:/kpoints/HigherHRNet-Human-Pose-Estimation# /usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown len(cache))

HRNet / HigherHRNet-Human-Pose-Estimation

out of memory #80