[BUG] Voice training stops really early with the overtraining + error py

BibouOnAir commented 2 months ago

On the new version uploaded on August 7, 2024 Running on GPU PaperSpace

Voice training stops really early with the overtraining detector. Here is the error code:

Can you tell me more.. I don't understand... Thank you very much for your work

Stopping training due to possible overtraining. Lowest generator loss: 8.961639404296875 at epoch 54, step 1338                                                                                       
Process Process-2:                                                                                                                                                                                    
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/notebooks/Applio/rvc/train/train.py", line 417, in run
    train_and_evaluate(
  File "/notebooks/Applio/rvc/train/train.py", line 638, in train_and_evaluate
    scaler.scale(loss_disc).backward()
  File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:525] Read error [10.36.141.29]:13703: Connection reset by peer
/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

blaisewf commented 2 months ago

restart your environment, not a normal behavior

BibouOnAir commented 2 months ago

I installed Applio on blank PaperSpace machines Following these steps Cd wget -q https://repo.anaconda.com/miniconda/Miniconda3-py310_24.4.0-0-Linux-x86_64.sh chmod +x Miniconda3-py310_24.4.0-0-Linux-x86_64.sh ./Miniconda3-py310_24.4.0-0-Linux-x86_64.sh -b -f -p /usr/local rm Miniconda3-py310_24.4.0-0-Linux-x86_64.sh && export LD_LIBRARY_PATH=/usr/local/lib/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH

cd /notebooks/ git clone https://github.com/IAHispano/Applio.git true cd Applio/

make run-install

make run-applio

After launching the training it stops much too early and here is the error I have

Process Process-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/notebooks/Applio/rvc/train/train.py", line 403, in run
    train_and_evaluate(
  File "/notebooks/Applio/rvc/train/train.py", line 779, in train_and_evaluate
    os.remove(file)
FileNotFoundError: [Errno 2] No such file or directory: '/notebooks/Applio/logs/GIMS/GIMS_41e_1271s_best_epoch.pth'
Process Process-2:                                                                                                                                                               
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/notebooks/Applio/rvc/train/train.py", line 417, in run
    train_and_evaluate(
  File "/notebooks/Applio/rvc/train/train.py", line 638, in train_and_evaluate
    scaler.scale(loss_disc).backward()
  File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.38.157.218]:56964

BibouOnAir commented 2 months ago

On several different machines, after re-installing a new environment, still the same problem

PLK | epoch=25 | step=650 | time=07:49:30 | training_speed=0:00:25 | lowest_value=28.921 (epoch 18 and step 446) | Number of epochs remaining for overtraining: 93               
Saved model '/notebooks/Applio/logs/PLK/PLK_26e_676s_best_epoch.pth' (epoch 26 and step 676)                                                                                     
Process Process-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/notebooks/Applio/rvc/train/train.py", line 403, in run
    train_and_evaluate(
  File "/notebooks/Applio/rvc/train/train.py", line 779, in train_and_evaluate
    os.remove(file)
FileNotFoundError: [Errno 2] No such file or directory: '/notebooks/Applio/logs/PLK/PLK_18e_468s_best_epoch.pth'
Process Process-2:                                                                                                                                                               
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/notebooks/Applio/rvc/train/train.py", line 417, in run
    train_and_evaluate(
  File "/notebooks/Applio/rvc/train/train.py", line 638, in train_and_evaluate
    scaler.scale(loss_disc).backward()
  File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.37.213.40]:11435
Saved index file '/notebooks/Applio/logs/PLK/added_PLK_v2.index'

BibouOnAir commented 2 months ago

This is the error I get when I try to restart the workout

GIMS | epoch=1 | step=31 | time=07:59:56 | training_speed=0:00:24
Process Process-1:                                                                                                                                                               
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/notebooks/Applio/rvc/train/train.py", line 403, in run
    train_and_evaluate(
  File "/notebooks/Applio/rvc/train/train.py", line 638, in train_and_evaluate
    scaler.scale(loss_disc).backward()
  File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.38.157.218]:40337
Saved index file '/notebooks/Applio/logs/GIMS/added_GIMS_v2.index'

IAHispano / Applio

[BUG] Voice training stops really early with the overtraining + error py #554