Here is the error message after the memory got filled:
...
[GPU0] Training epoch: 14
0%| | 0/2948 [00:00<?, ?it/s][GPU1] Training epoch: 14
8%|████████ | 228/2948 [01:49<12:51, 3.53it/s][GPU0] Model saved
25%|█████████████████████████▋ | 728/2948 [04:39<16:02, 2.31it/s][GPU0] Model saved
42%|██████████████████████████████████████████▉ | 1228/2948 [07:49<10:03, 2.85it/s][GPU0] Model saved
59%|████████████████████████████████████████████████████████████▎ | 1728/2948 [10:39<06:38, 3.06it/s][GPU0] Model saved
76%|█████████████████████████████████████████████████████████████████████████████▊ | 2228/2948 [13:54<04:46, 2.51it/s][GPU0] Model saved
93%|███████████████████████████████████████████████████████████████████████████████████████████████▎ | 2728/2948 [17:08<01:15, 2.92it/s][GPU0] Model saved
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00, 2.66it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00, 2.66it/s]
[GPU0] Validating at the start of epoch: 15
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2032/2032 [02:59<00:00, 11.34it/s]
[GPU0] Validation set average loss: 0.2255345582962036
[GPU0] Training epoch: 15
[GPU1] Training epoch: 15
0%| | 0/2948 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/xx/DL/Repos/RobustVideoMatting/train_xx.py", line 501, in <module>
mp.spawn(
File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 138 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Is there any reason for Pytorch 2 to have memory leakage when combined with the distributed training?
I disabled the distributed training by modifying the code to avoid memory leakage when using Pytorch 2. Pytorch 2 trains a lot faster than the 1.X.
Thanks for this work.
I was trying to train the model using the conda environment:
But it ended up using up all of the RAM after a few hours.
When I downgraded my torch then it worked:
Here is the error message after the memory got filled:
Is there any reason for Pytorch 2 to have memory leakage when combined with the distributed training? I disabled the distributed training by modifying the code to avoid memory leakage when using Pytorch 2. Pytorch 2 trains a lot faster than the 1.X.