Closed BibouOnAir closed 2 months ago
restart your environment, not a normal behavior
I installed Applio on blank PaperSpace machines Following these steps Cd wget -q https://repo.anaconda.com/miniconda/Miniconda3-py310_24.4.0-0-Linux-x86_64.sh chmod +x Miniconda3-py310_24.4.0-0-Linux-x86_64.sh ./Miniconda3-py310_24.4.0-0-Linux-x86_64.sh -b -f -p /usr/local rm Miniconda3-py310_24.4.0-0-Linux-x86_64.sh && export LD_LIBRARY_PATH=/usr/local/lib/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
cd /notebooks/ git clone https://github.com/IAHispano/Applio.git true cd Applio/
make run-install
make run-applio
After launching the training it stops much too early and here is the error I have
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/notebooks/Applio/rvc/train/train.py", line 403, in run
train_and_evaluate(
File "/notebooks/Applio/rvc/train/train.py", line 779, in train_and_evaluate
os.remove(file)
FileNotFoundError: [Errno 2] No such file or directory: '/notebooks/Applio/logs/GIMS/GIMS_41e_1271s_best_epoch.pth'
Process Process-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/notebooks/Applio/rvc/train/train.py", line 417, in run
train_and_evaluate(
File "/notebooks/Applio/rvc/train/train.py", line 638, in train_and_evaluate
scaler.scale(loss_disc).backward()
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.38.157.218]:56964
On several different machines, after re-installing a new environment, still the same problem
PLK | epoch=25 | step=650 | time=07:49:30 | training_speed=0:00:25 | lowest_value=28.921 (epoch 18 and step 446) | Number of epochs remaining for overtraining: 93
Saved model '/notebooks/Applio/logs/PLK/PLK_26e_676s_best_epoch.pth' (epoch 26 and step 676)
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/notebooks/Applio/rvc/train/train.py", line 403, in run
train_and_evaluate(
File "/notebooks/Applio/rvc/train/train.py", line 779, in train_and_evaluate
os.remove(file)
FileNotFoundError: [Errno 2] No such file or directory: '/notebooks/Applio/logs/PLK/PLK_18e_468s_best_epoch.pth'
Process Process-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/notebooks/Applio/rvc/train/train.py", line 417, in run
train_and_evaluate(
File "/notebooks/Applio/rvc/train/train.py", line 638, in train_and_evaluate
scaler.scale(loss_disc).backward()
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.37.213.40]:11435
Saved index file '/notebooks/Applio/logs/PLK/added_PLK_v2.index'
This is the error I get when I try to restart the workout
GIMS | epoch=1 | step=31 | time=07:59:56 | training_speed=0:00:24
Process Process-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/notebooks/Applio/rvc/train/train.py", line 403, in run
train_and_evaluate(
File "/notebooks/Applio/rvc/train/train.py", line 638, in train_and_evaluate
scaler.scale(loss_disc).backward()
File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.38.157.218]:40337
Saved index file '/notebooks/Applio/logs/GIMS/added_GIMS_v2.index'
On the new version uploaded on August 7, 2024 Running on GPU PaperSpace
Voice training stops really early with the overtraining detector. Here is the error code:
Can you tell me more.. I don't understand... Thank you very much for your work