断点继续训练时出错

AlexandaJerry / whisper-vits-japanese

Vits Japanese with Whisper as data processor (you can train your VITS even you only have audios)

MIT License

160 stars 28 forks source link

这是部分日志图，难道是因为第一次训练还没有完全结束的原因吗？

[INFO] {'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 800, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 24, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_filelist.txt.cleaned', 'validation_files': 'filelists/val_filelist.txt.cleaned', 'text_cleaners': ['japanese_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/isla_base'} 2023-04-14 13:21:23.301526: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-04-14 13:21:24.298365: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( ./logs/isla_base/G_0.pth [INFO] Loaded checkpoint './logs/isla_base/G_0.pth' (iteration 1) ./logs/isla_base/D_0.pth [INFO] Loaded checkpoint './logs/isla_base/D_0.pth' (iteration 1) /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:803.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:32.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/autograd/init.py:173: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 96], strides() = [51936, 96, 1] bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [INFO] Train Epoch: 1 [0%] [INFO] [6.065904140472412, 6.065133094787598, 0.47868022322654724, 108.19261169433594, 1.6783794164657593, 228.80638122558594, 0, 0.0002] /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( [INFO] Saving model and optimizer state at iteration 1 to ./logs/isla_base/G_0.pth [INFO] Saving model and optimizer state at iteration 1 to ./logs/isla_base/D_0.pth [INFO] ====> Epoch: 1 [INFO] ====> Epoch: 2

重新花了半个多小时试了怎么就不行了？明明能恢复能断点续练？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？？

[INFO] {'train': {'log_interval': 100, 'eval_interval': 100, 'seed': 1234, 'epochs': 800, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 24, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_filelist.txt.cleaned', 'validation_files': 'filelists/val_filelist.txt.cleaned', 'text_cleaners': ['japanese_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/isla_base'} 2023-04-14 14:59:30.046027: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-04-14 14:59:31.298123: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( ./logs/isla_base/G_19000.pth [INFO] Loaded checkpoint './logs/isla_base/G_19000.pth' (iteration 214) ./logs/isla_base/D_19000.pth [INFO] Loaded checkpoint './logs/isla_base/D_19000.pth' (iteration 214) /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:803.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:32.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/autograd/init.py:173: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 96], strides() = [44640, 96, 1] bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [INFO] ====> Epoch: 214 [INFO] ====> Epoch: 215

AlexandaJerry / whisper-vits-japanese

断点继续训练时出错 #14