Distributed training with pytorch +2 results in memory leakage

Thanks for this work.

I was trying to train the model using the conda environment:

pytorch                   2.1.2           py3.11_cuda11.8_cudnn8.7.0_0    pytorch
pytorch-cuda              11.8                 h7e8668a_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch

But it ended up using up all of the RAM after a few hours.

When I downgraded my torch then it worked:

pytorch                   1.13.1          py3.10_cuda11.7_cudnn8.5.0_0    pytorch
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch

Here is the error message after the memory got filled:

...
[GPU0] Training epoch: 14
  0%|                                                                                                                  | 0/2948 [00:00<?, ?it/s][GPU1] Training epoch: 14
  8%|████████                                                                                                | 228/2948 [01:49<12:51,  3.53it/s][GPU0] Model saved
 25%|█████████████████████████▋                                                                              | 728/2948 [04:39<16:02,  2.31it/s][GPU0] Model saved
 42%|██████████████████████████████████████████▉                                                            | 1228/2948 [07:49<10:03,  2.85it/s][GPU0] Model saved
 59%|████████████████████████████████████████████████████████████▎                                          | 1728/2948 [10:39<06:38,  3.06it/s][GPU0] Model saved
 76%|█████████████████████████████████████████████████████████████████████████████▊                         | 2228/2948 [13:54<04:46,  2.51it/s][GPU0] Model saved
 93%|███████████████████████████████████████████████████████████████████████████████████████████████▎       | 2728/2948 [17:08<01:15,  2.92it/s][GPU0] Model saved
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00,  2.66it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00,  2.66it/s]
[GPU0] Validating at the start of epoch: 15
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2032/2032 [02:59<00:00, 11.34it/s]
[GPU0] Validation set average loss: 0.2255345582962036
[GPU0] Training epoch: 15
[GPU1] Training epoch: 15
  0%|                                                                                                                  | 0/2948 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/xx/DL/Repos/RobustVideoMatting/train_xx.py", line 501, in <module>
    mp.spawn(
  File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
 /home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 138 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Is there any reason for Pytorch 2 to have memory leakage when combined with the distributed training? I disabled the distributed training by modifying the code to avoid memory leakage when using Pytorch 2. Pytorch 2 trains a lot faster than the 1.X.

PeterL1n / RobustVideoMatting

Distributed training with pytorch +2 results in memory leakage #262