AlexandaJerry / whisper-vits-japanese

Vits Japanese with Whisper as data processor (you can train your VITS even you only have audios)
MIT License
160 stars 28 forks source link

断点继续训练时出错 #14

Open Hyatt-L opened 1 year ago

Hyatt-L commented 1 year ago

这是部分日志图,难道是因为第一次训练还没有完全结束的原因吗?

[INFO] {'train': {'log_interval': 200, 'eval_interval': 1000, 'seed': 1234, 'epochs': 800, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 24, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_filelist.txt.cleaned', 'validation_files': 'filelists/val_filelist.txt.cleaned', 'text_cleaners': ['japanese_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/isla_base'} 2023-04-14 13:21:23.301526: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-04-14 13:21:24.298365: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( ./logs/isla_base/G_0.pth [INFO] Loaded checkpoint './logs/isla_base/G_0.pth' (iteration 1) ./logs/isla_base/D_0.pth [INFO] Loaded checkpoint './logs/isla_base/D_0.pth' (iteration 1) /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:803.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:32.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/autograd/init.py:173: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 96], strides() = [51936, 96, 1] bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [INFO] Train Epoch: 1 [0%] [INFO] [6.065904140472412, 6.065133094787598, 0.47868022322654724, 108.19261169433594, 1.6783794164657593, 228.80638122558594, 0, 0.0002] /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( [INFO] Saving model and optimizer state at iteration 1 to ./logs/isla_base/G_0.pth [INFO] Saving model and optimizer state at iteration 1 to ./logs/isla_base/D_0.pth [INFO] ====> Epoch: 1 [INFO] ====> Epoch: 2

AlexandaJerry commented 1 year ago

没出错啊这不是训练起来了吗,断点续不了应该看看上轮的G和D有没有存好

AlexandaJerry commented 1 year ago

重新花了半个多小时试了 怎么就不行了?明明能恢复能断点续练???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

[INFO] {'train': {'log_interval': 100, 'eval_interval': 100, 'seed': 1234, 'epochs': 800, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 24, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'training_files': 'filelists/train_filelist.txt.cleaned', 'validation_files': 'filelists/val_filelist.txt.cleaned', 'text_cleaners': ['japanese_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'n_layers_q': 3, 'use_spectral_norm': False}, 'model_dir': './logs/isla_base'} 2023-04-14 14:59:30.046027: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-04-14 14:59:31.298123: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:563: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( ./logs/isla_base/G_19000.pth [INFO] Loaded checkpoint './logs/isla_base/G_19000.pth' (iteration 214) ./logs/isla_base/D_19000.pth [INFO] Loaded checkpoint './logs/isla_base/D_19000.pth' (iteration 214) /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:803.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/functional.py:606: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at ../aten/src/ATen/EmptyTensor.cpp:32.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined] /usr/local/lib/python3.9/dist-packages/torch/autograd/init.py:173: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 96], strides() = [44640, 96, 1] bucket_view.sizes() = [1, 9, 96], strides() = [864, 96, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:326.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [INFO] ====> Epoch: 214 [INFO] ====> Epoch: 215