Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/kaggle/working/program_ml/rvc/train/train.py", line 509, in run
train_and_evaluate(
File "/kaggle/working/program_ml/rvc/train/train.py", line 707, in train_and_evaluate
scaler.scale(loss_disc).backward()
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.19.2.2]:48294
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 42 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Project Version
Latest
Platform and OS Version
Kaggle
Affected Devices
Kaggle Latest Environment
Existing Issues
No response
What happened?
Traceback (most recent call last): File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/kaggle/working/program_ml/rvc/train/train.py", line 509, in run train_and_evaluate( File "/kaggle/working/program_ml/rvc/train/train.py", line 707, in train_and_evaluate scaler.scale(loss_disc).backward() File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.19.2.2]:48294 /opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 42 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
Steps to reproduce
Happens during training between 100 ~ 500 epochs
Expected behavior
Continue the training without this error
Attachments
No response
Screenshots or Videos
No response
Additional Information
No response