RVC-Project / Retrieval-based-Voice-Conversion-WebUI

Easily train a good VC model with voice data <= 10 mins!
MIT License
24.43k stars 3.6k forks source link

Dual GPU - RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer #1557

Closed hpnyaggerman closed 6 months ago

hpnyaggerman commented 12 months ago

Issue: Training Process Halts with Errors

Environment Specifications:

Steps to Reproduce:

  1. Created a conda environment with Python version 3.10.
  2. Cloned RVC's git repository.
  3. Run run.sh and install dependencies, and then run RVC-WebUI.
  4. Navigated to the training tab, performed preprocessing, and then performed feature extraction.
  5. Started the training process.

Issue Encountered: During training, I encountered a series of errors, causing the training process to halt. The detailed error log is as follows:

2023-11-18 02:32:55 | INFO | __main__ | Use gpus: 0-1
2023-11-18 02:32:55 | INFO | __main__ | "python3" infer/modules/train/train.py -e "model-test" -sr 48k -f0 1 -bs 12 -g 0-1 -te 1000 -se 1 -pg /home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2/f0G48k.pth -pd /home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2/f0D48k.pth -l 0 -c 1 -sw 0 -v v2
INFO:model-test:{'data': {'filter_length': 2048, 'hop_length': 480, 'max_wav_value': 32768.0, 'mel_fmax': None, 'mel_fmin': 0.0, 'n_mel_channels': 128, 'sampling_rate': 48000, 'win_length': 2048, 'training_files': './logs/model-test/filelist.txt'}, 'model': {'filter_channels': 768, 'gin_channels': 256, 'hidden_channels': 192, 'inter_channels': 192, 'kernel_size': 3, 'n_heads': 2, 'n_layers': 6, 'p_dropout': 0, 'resblock': '1', 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'spk_embed_dim': 109, 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [24, 20, 4, 4], 'upsample_rates': [12, 10, 2, 2], 'use_spectral_norm': False}, 'train': {'batch_size': 12, 'betas': [0.8, 0.99], 'c_kl': 1.0, 'c_mel': 45, 'epochs': 20000, 'eps': 1e-09, 'fp16_run': True, 'init_lr_ratio': 1, 'learning_rate': 0.0001, 'log_interval': 200, 'lr_decay': 0.999875, 'seed': 1234, 'segment_size': 17280, 'warmup_epochs': 0}, 'model_dir': './logs/model-test', 'experiment_dir': './logs/model-test', 'save_every_epoch': 1, 'name': 'model-test', 'total_epoch': 1000, 'pretrainG': '/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2/f0G48k.pth', 'pretrainD': '/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2/f0D48k.pth', 'version': 'v2', 'gpus': '0-1', 'sample_rate': '48k', 'if_f0': 1, 'if_latest': 0, 'save_every_weights': '0', 'if_cache_data_in_gpu': 1}
/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
DEBUG:infer.lib.infer_pack.models:gin_channels: 256, self.spk_embed_dim: 109
DEBUG:infer.lib.infer_pack.models:gin_channels: 256, self.spk_embed_dim: 109
INFO:model-test:loaded pretrained /home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2/f0G48k.pth
Process Process-2:
Traceback (most recent call last):
  File "/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/infer/modules/train/train.py", line 213, in run
    utils.latest_checkpoint_path(hps.model_dir, "D_*.pth"), net_d, optim_d
  File "/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/infer/lib/train/utils.py", line 213, in latest_checkpoint_path
    x = f_list[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hkyouma/miniconda3/envs/voicegen-rvc/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/hkyouma/miniconda3/envs/voicegen-rvc/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/infer/modules/train/train.py", line 232, in run
    logger.info(
UnboundLocalError: local variable 'logger' referenced before assignment
INFO:model-test:<All keys matched successfully>
INFO:model-test:loaded pretrained /home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/assets/pretrained_v2/f0D48k.pth
INFO:model-test:<All keys matched successfully>
/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Process Process-1:
Traceback (most recent call last):
  File "/home/hkyouma/miniconda3/envs/voicegen-rvc/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/hkyouma/miniconda3/envs/voicegen-rvc/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/infer/modules/train/train.py", line 271, in run
    train_and_evaluate(
  File "/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/infer/modules/train/train.py", line 484, in train_and_evaluate
    scaler.scale(loss_disc).backward()
  File "/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/hkyouma/ai/voicegen/Retrieval-based-Voice-Conversion-WebUI/.venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:27868

Any assistance or guidance on resolving these errors would be greatly appreciated.

hpnyaggerman commented 12 months ago

The error seems to stem from using two GPUs for training. Using just one makes the issue go away.

Irennnne commented 11 months ago

I changed to gpu:0 but still get the same error:

(cvr) kk1@kk1:/media/disk1_ssd/Rvc$ python infer-web.py
2023-11-19 23:53:30 | INFO | configs.config | Found GPU NVIDIA GeForce RTX 3090
is_half:True, device:cuda:0
2023-11-19 23:53:31 | INFO | __main__ | Use Language: en_US
Running on local URL:  http://0.0.0.0:7865
2023-11-19 23:54:12 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-19 23:54:12 | INFO | __main__ | "/home/miniconda3/envs/cvr/bin/python" infer/modules/train/preprocess.py "/media/disk1_ssd/voc" 40000 22 "/media/disk1_ssd/Rvc/logs/trial" False 3.0
['infer/modules/train/preprocess.py', '/media/disk1_ssd/voc', '40000', '22', '/media/disk1_ssd/Rvc/logs/trial', 'False', '3.0']
start preprocess
['infer/modules/train/preprocess.py', '/media/disk1_ssd/voc', '40000', '22', '/media/disk1_ssd/Rvc/logs/trial', 'False', '3.0']
/media/disk1_ssd/voc/.DS_Store->Traceback (most recent call last):
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 63, in load_audio
    audio2(f, out, "f32le", sr)
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 34, in audio2
    inp = av.open(i, "rb")
  File "av/container/core.pyx", line 401, in av.container.core.open
  File "av/container/core.pyx", line 265, in av.container.core.Container.__cinit__
  File "av/container/core.pyx", line 285, in av.container.core.Container.err_check
  File "av/error.pyx", line 336, in av.error.err_check
av.error.InvalidDataError: [Errno 1094995529] Invalid data found when processing input: '/media/disk1_ssd/voc/.DS_Store'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "infer/modules/train/preprocess.py", line 87, in pipeline
    audio = load_audio(path, self.sr)
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 73, in load_audio
    raise RuntimeError(traceback.format_exc())
RuntimeError: Traceback (most recent call last):
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 63, in load_audio
    audio2(f, out, "f32le", sr)
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 34, in audio2
    inp = av.open(i, "rb")
  File "av/container/core.pyx", line 401, in av.container.core.open
  File "av/container/core.pyx", line 265, in av.container.core.Container.__cinit__
  File "av/container/core.pyx", line 285, in av.container.core.Container.err_check
  File "av/error.pyx", line 336, in av.error.err_check
av.error.InvalidDataError: [Errno 1094995529] Invalid data found when processing input: '/media/disk1_ssd/voc/.DS_Store'

/media/disk1_ssd/voc/vocal_6.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_5.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_1.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_4.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_3.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_7.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_2.mp3_10.wav->Suc.
end preprocess
2023-11-19 23:54:18 | INFO | __main__ | start preprocess
['infer/modules/train/preprocess.py', '/media/disk1_ssd/voc', '40000', '22', '/media/disk1_ssd/Rvc/logs/trial', 'False', '3.0']
/media/disk1_ssd/voc/.DS_Store->Traceback (most recent call last):
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 63, in load_audio
    audio2(f, out, "f32le", sr)
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 34, in audio2
    inp = av.open(i, "rb")
  File "av/container/core.pyx", line 401, in av.container.core.open
  File "av/container/core.pyx", line 265, in av.container.core.Container.__cinit__
  File "av/container/core.pyx", line 285, in av.container.core.Container.err_check
  File "av/error.pyx", line 336, in av.error.err_check
av.error.InvalidDataError: [Errno 1094995529] Invalid data found when processing input: '/media/disk1_ssd/voc/.DS_Store'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "infer/modules/train/preprocess.py", line 87, in pipeline
    audio = load_audio(path, self.sr)
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 73, in load_audio
    raise RuntimeError(traceback.format_exc())
RuntimeError: Traceback (most recent call last):
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 63, in load_audio
    audio2(f, out, "f32le", sr)
  File "/media/disk1_ssd/Rvc/infer/lib/audio.py", line 34, in audio2
    inp = av.open(i, "rb")
  File "av/container/core.pyx", line 401, in av.container.core.open
  File "av/container/core.pyx", line 265, in av.container.core.Container.__cinit__
  File "av/container/core.pyx", line 285, in av.container.core.Container.err_check
  File "av/error.pyx", line 336, in av.error.err_check
av.error.InvalidDataError: [Errno 1094995529] Invalid data found when processing input: '/media/disk1_ssd/voc/.DS_Store'

/media/disk1_ssd/voc/vocal_6.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_5.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_1.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_4.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_3.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_7.mp3_10.wav->Suc.
/media/disk1_ssd/voc/vocal_2.mp3_10.wav->Suc.
end preprocess

2023-11-19 23:54:18 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-19 23:54:18 | INFO | __main__ | "/home/miniconda3/envs/cvr/bin/python" infer/modules/train/extract/extract_f0_rmvpe.py 4 0 0 "/media/disk1_ssd/Rvc/logs/trial" True 
2023-11-19 23:54:18 | INFO | __main__ | "/home/miniconda3/envs/cvr/bin/python" infer/modules/train/extract/extract_f0_rmvpe.py 4 1 1 "/media/disk1_ssd/Rvc/logs/trial" True 
2023-11-19 23:54:18 | INFO | __main__ | "/home/miniconda3/envs/cvr/bin/python" infer/modules/train/extract/extract_f0_rmvpe.py 4 2 0 "/media/disk1_ssd/Rvc/logs/trial" True 
2023-11-19 23:54:18 | INFO | __main__ | "/home/miniconda3/envs/cvr/bin/python" infer/modules/train/extract/extract_f0_rmvpe.py 4 3 1 "/media/disk1_ssd/Rvc/logs/trial" True 
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '0', '0', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-82
f0ing,now-0,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_0.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '3', '1', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-81
f0ing,now-0,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_12.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '1', '1', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-82
f0ing,now-0,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_10.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '2', '0', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-81
f0ing,now-0,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_11.wav
Loading rmvpe model
Loading rmvpe model
Loading rmvpe model
Loading rmvpe model
f0ing,now-16,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_15.wav
f0ing,now-16,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_17.wav
f0ing,now-32,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_95.wav
f0ing,now-16,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_16.wav
f0ing,now-16,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_148.wav
f0ing,now-32,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_98.wav
f0ing,now-48,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_19.wav
f0ing,now-32,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_94.wav
f0ing,now-32,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_96.wav
f0ing,now-48,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_20.wav
f0ing,now-64,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_10.wav
f0ing,now-48,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_18.wav
f0ing,now-64,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_12.wav
f0ing,now-48,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_2.wav
f0ing,now-80,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_90.wav
f0ing,now-64,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_1.wav
f0ing,now-64,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_11.wav
f0ing,now-80,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_92.wav
f0ing,now-80,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_9.wav
f0ing,now-80,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_91.wav
2023-11-19 23:54:26 | INFO | __main__ | ['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '0', '0', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-82
f0ing,now-0,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_0.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '3', '1', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-81
f0ing,now-0,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_12.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '1', '1', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-82
f0ing,now-0,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_10.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '2', '0', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-81
f0ing,now-0,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_11.wav
f0ing,now-16,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_15.wav
f0ing,now-16,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_17.wav
f0ing,now-32,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_95.wav
f0ing,now-16,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_16.wav
f0ing,now-16,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_148.wav
f0ing,now-32,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_98.wav
f0ing,now-48,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_19.wav
f0ing,now-32,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_94.wav
f0ing,now-32,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_96.wav
f0ing,now-48,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_20.wav
f0ing,now-64,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_10.wav
f0ing,now-48,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_18.wav
f0ing,now-64,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_12.wav
f0ing,now-48,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_2.wav
f0ing,now-80,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_90.wav
f0ing,now-64,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_1.wav
f0ing,now-64,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_11.wav
f0ing,now-80,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_92.wav
f0ing,now-80,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_9.wav
f0ing,now-80,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_91.wav

2023-11-19 23:54:26 | INFO | __main__ | "/home/miniconda3/envs/cvr/bin/python" infer/modules/train/extract_feature_print.py cuda:0 1 0 0 "/media/disk1_ssd/Rvc/logs/trial" v2
['infer/modules/train/extract_feature_print.py', 'cuda:0', '1', '0', '0', '/media/disk1_ssd/Rvc/logs/trial', 'v2']
/media/disk1_ssd/Rvc/logs/trial
load model(s) from assets/hubert/hubert_base.pt
2023-11-19 23:54:26 | INFO | fairseq.tasks.hubert_pretraining | current directory is /media/disk1_ssd/Rvc
2023-11-19 23:54:26 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-11-19 23:54:26 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
move model to cuda
all-feature-326
now-326,all-0,1_0.wav,(149, 768)
now-326,all-32,2_11.wav,(155, 768)
now-326,all-64,2_148.wav,(134, 768)
now-326,all-96,2_56.wav,(105, 768)
now-326,all-128,2_94.wav,(149, 768)
now-326,all-160,3_42.wav,(149, 768)
now-326,all-192,4_18.wav,(149, 768)
now-326,all-224,5_16.wav,(149, 768)
now-326,all-256,7_1.wav,(149, 768)
now-326,all-288,7_50.wav,(113, 768)
now-326,all-320,7_9.wav,(140, 768)
all-feature-done
2023-11-19 23:54:33 | INFO | __main__ | ['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '0', '0', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-82
f0ing,now-0,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_0.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '3', '1', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-81
f0ing,now-0,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_12.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '1', '1', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-82
f0ing,now-0,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_10.wav
['infer/modules/train/extract/extract_f0_rmvpe.py', '4', '2', '0', '/media/disk1_ssd/Rvc/logs/trial', 'True']
todo-f0-81
f0ing,now-0,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/1_11.wav
f0ing,now-16,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_15.wav
f0ing,now-16,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_17.wav
f0ing,now-32,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_95.wav
f0ing,now-16,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_16.wav
f0ing,now-16,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_148.wav
f0ing,now-32,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_98.wav
f0ing,now-48,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_19.wav
f0ing,now-32,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_94.wav
f0ing,now-32,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/2_96.wav
f0ing,now-48,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_20.wav
f0ing,now-64,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_10.wav
f0ing,now-48,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_18.wav
f0ing,now-64,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_12.wav
f0ing,now-48,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/4_2.wav
f0ing,now-80,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_90.wav
f0ing,now-64,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_1.wav
f0ing,now-64,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_11.wav
f0ing,now-80,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_92.wav
f0ing,now-80,all-82,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_9.wav
f0ing,now-80,all-81,-/media/disk1_ssd/Rvc/logs/trial/1_16k_wavs/7_91.wav
['infer/modules/train/extract_feature_print.py', 'cuda:0', '1', '0', '0', '/media/disk1_ssd/Rvc/logs/trial', 'v2']
/media/disk1_ssd/Rvc/logs/trial
load model(s) from assets/hubert/hubert_base.pt
move model to cuda
all-feature-326
now-326,all-0,1_0.wav,(149, 768)
now-326,all-32,2_11.wav,(155, 768)
now-326,all-64,2_148.wav,(134, 768)
now-326,all-96,2_56.wav,(105, 768)
now-326,all-128,2_94.wav,(149, 768)
now-326,all-160,3_42.wav,(149, 768)
now-326,all-192,4_18.wav,(149, 768)
now-326,all-224,5_16.wav,(149, 768)
now-326,all-256,7_1.wav,(149, 768)
now-326,all-288,7_50.wav,(113, 768)
now-326,all-320,7_9.wav,(140, 768)
all-feature-done

2023-11-19 23:54:33 | INFO | httpx | HTTP Request: POST http://localhost:7865/api/predict "HTTP/1.1 200 OK"
2023-11-19 23:54:33 | INFO | __main__ | Use gpus: 0
2023-11-19 23:54:33 | INFO | __main__ | "/home/miniconda3/envs/cvr/bin/python" infer/modules/train/train.py -e "trial" -sr 40k -f0 1 -bs 12 -g 0 -te 20 -se 5 -pg assets/pretrained_v2/f0G40k.pth -pd assets/pretrained_v2/f0D40k.pth -l 0 -c 0 -sw 0 -v v2
INFO:trial:{'data': {'filter_length': 2048, 'hop_length': 400, 'max_wav_value': 32768.0, 'mel_fmax': None, 'mel_fmin': 0.0, 'n_mel_channels': 125, 'sampling_rate': 40000, 'win_length': 2048, 'training_files': './logs/trial/filelist.txt'}, 'model': {'filter_channels': 768, 'gin_channels': 256, 'hidden_channels': 192, 'inter_channels': 192, 'kernel_size': 3, 'n_heads': 2, 'n_layers': 6, 'p_dropout': 0, 'resblock': '1', 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'spk_embed_dim': 109, 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_rates': [10, 10, 2, 2], 'use_spectral_norm': False}, 'train': {'batch_size': 12, 'betas': [0.8, 0.99], 'c_kl': 1.0, 'c_mel': 45, 'epochs': 20000, 'eps': 1e-09, 'fp16_run': False, 'init_lr_ratio': 1, 'learning_rate': 0.0001, 'log_interval': 200, 'lr_decay': 0.999875, 'seed': 1234, 'segment_size': 12800, 'warmup_epochs': 0}, 'model_dir': './logs/trial', 'experiment_dir': './logs/trial', 'save_every_epoch': 5, 'name': 'trial', 'total_epoch': 20, 'pretrainG': 'assets/pretrained_v2/f0G40k.pth', 'pretrainD': 'assets/pretrained_v2/f0D40k.pth', 'version': 'v2', 'gpus': '0', 'sample_rate': '40k', 'if_f0': 1, 'if_latest': 0, 'save_every_weights': '0', 'if_cache_data_in_gpu': 0}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Process Process-2:
Traceback (most recent call last):
  File "/home/miniconda3/envs/cvr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
  File "/home/miniconda3/envs/cvr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/media/disk1_ssd/Rvc/infer/modules/train/train.py", line 137, in run
    torch.cuda.set_device(rank)
  File "/home/miniconda3/envs/cvr/lib/python3.8/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
DEBUG:infer.lib.infer_pack.models:gin_channels: 256, self.spk_embed_dim: 109
Process Process-1:
Traceback (most recent call last):
  File "/home/miniconda3/envs/cvr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
  File "/home/miniconda3/envs/cvr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/media/disk1_ssd/Rvc/infer/modules/train/train.py", line 205, in run
    net_g = DDP(net_g, device_ids=[rank])
  File "/home/miniconda3/envs/cvr/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/miniconda3/envs/cvr/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [fca1:de0c:6dd9:56e9:448a::1]:21721
jnjnnjzch commented 11 months ago

My solution: use one GPU to train for one or more epoches to create a checkpoint. Then continue training on multi-GPUs. It seems that RVC can train on multi-GPUs by loading an existed checkpoint.

ospreyclaw commented 10 months ago

I have the same problem. Have you fixed this problem?

1044690543 commented 9 months ago

I also have the same problem, trying to solve

Irennnne commented 9 months ago

I have solved this problem and I think adding “ CUDA_LAUNCH_BLOCKING=1” works.

1044690543 commented 9 months ago

I also have the same problem, trying to solve

Using the latest TAG can solve the problem

vitt95 commented 9 months ago

same error for me, did anyone find a solution ?

ymmbb8882ymmbb commented 8 months ago

我也有同样的问题,正在尝试解决

使用最新的TAG即可解决问题

我用了10月份的tar包,还是遇到了这个问题,并且他并不支持多卡训练IndexError: list index out of range

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 15 days since being marked as stale.