MasayaKawamura / MB-iSTFT-VITS

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
Apache License 2.0
401 stars 64 forks source link

'terminate called without an active exception' during training #21

Closed katya-bateeva closed 1 year ago

katya-bateeva commented 1 year ago

When I start training (on LJSpeech dataset), it's terminated on the first epoch. The traceback is:

python train_latest.py -c configs/ljs_mini_mb_istft_vits.json -m ljs_mini_mb_istft_vits [INFO] {'train': {'log_interval': 200, 'eval_interval': 10000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 4, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'fft_sizes': [384, 683, 171], 'hop_sizes': [30, 60, 10], 'win_lengths': [150, 300, 60], 'window': 'hann_window'}, 'data': {'training_files': 'filelists/ljs_audio_text_train_filelist.txt.cleaned', 'validation_files': 'filelists/ljs_audio_text_val_filelist.txt.cleaned', 'text_cleaners': ['english_cleaners2'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 0, 'cleaned_text': True}, 'model': {'ms_istft_vits': False, 'mb_istft_vits': True, 'istft_vits': False, 'subbands': 4, 'gen_istft_n_fft': 16, 'gen_istft_hop_size': 4, 'inter_channels': 192, 'hidden_channels': 96, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 3, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [4, 4], 'upsample_initial_channel': 256, 'upsample_kernel_sizes': [16, 16], 'n_layers_q': 3, 'use_spectral_norm': False, 'use_sdp': False}, 'model_dir': './logs/ljs_mini_mb_istft_vits'} Mutli-band iSTFT VITS Mutli-band iSTFT VITS ./logs/ljs_mini_mb_istft_vits/G_0.pth ./logs/ljs_mini_mb_istft_vits/G_0.pth [INFO] Loaded checkpoint './logs/ljs_mini_mb_istft_vits/G_0.pth' (iteration 1) ./logs/ljs_mini_mb_istft_vits/D_0.pth ./logs/ljs_mini_mb_istft_vits/D_0.pth [INFO] Loaded checkpoint './logs/ljs_mini_mb_istft_vits/D_0.pth' (iteration 1) /home/kbateeva/Projects/TTS/MB-iSTFT-VITS/venv/lib/python3.10/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 48], strides() = [17904, 48, 1] bucket_view.sizes() = [1, 9, 48], strides() = [432, 48, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/kbateeva/Projects/TTS/MB-iSTFT-VITS/venv/lib/python3.10/site-packages/torch/autograd/init.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1, 9, 48], strides() = [17712, 48, 1] bucket_view.sizes() = [1, 9, 48], strides() = [432, 48, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [INFO] Train Epoch: 1 [0%] [INFO] [4.6328253746032715, 2.759337902069092, 0.28215909004211426, 111.24901580810547, 0.9382677674293518, 90.15510559082031, 4.346133232116699, 0, 0.0002] terminate called without an active exception [INFO] Saving model and optimizer state at iteration 1 to ./logs/ljs_mini_mb_istft_vits/G_0.pth [INFO] Saving model and optimizer state at iteration 1 to ./logs/ljs_mini_mb_istft_vits/D_0.pth

Does somebody meet such error 'terminate called without an active exception'?

katya-bateeva commented 1 year ago

In my case, it was caused by different types of GPU. It isn't possible to start parallel training on them.