[Bug] VITS gpu utilization

Describe the bug

im training VITS model (Persian and English language) my dataset is consists of audio clips from 1 to 25s.Im training it on a A100 GPU but most of the time gpu memory is not even half and its utilization is not as i expect.

Screenshot 2024-04-28 091235

To Reproduce

i modified my code based on this script in coqui library:

https://github.com/coqui-ai/TTS/blob/dev/recipes/multilingual/vits_tts/train_vits_tts_phonemes.py

and these are the parameters that i set: audio_config = VitsAudioConfig( sample_rate=16000, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None, )

vitsArgs = VitsArgs( use_language_embedding=True, embedded_language_dim=2, use_speaker_embedding=True, use_sdp=False, )

config = VitsConfig( model_args=vitsArgs, audio=audio_config, run_name="A6_vits_multi_language_10_spk_5_ordibehesht", use_speaker_embedding=True, batch_size=48, eval_batch_size=32, batch_group_size=128, num_loader_workers=12, num_eval_loader_workers=8, precompute_num_workers=12, run_eval=True, test_delay_epochs=-1, epochs=1000, text_cleaner="multilingual_cleaners", use_phonemes=True, phoneme_language=None, phonemizer="multi_phonemizer", phoneme_cache_path=os.path.join(output_path, "phoneme_cache"), compute_input_seq_cache=True, print_step=25, use_language_weighted_sampler=True, print_eval=False, mixed_precision=True, output_path=output_path, datasets=dataset_config, cudnn_enable=True, cudnn_benchmark=True, cudnn_deterministic=True

Expected behavior

higher gpu utilization and faster training time

Logs

one of my steps log:

[1m   --> TIME: 2024-04-27 09:15:52 -- STEP: 124/3006 -- GLOBAL_STEP: 1750125[0m
     | > loss_disc: 2.7141058444976807  (2.7415779617524914)
     | > loss_disc_real_0: 0.2915174067020416  (0.22191733380238854)
     | > loss_disc_real_1: 0.2596714198589325  (0.2545961029827594)
     | > loss_disc_real_2: 0.25090914964675903  (0.2519173812601836)
     | > loss_disc_real_3: 0.2509034276008606  (0.2488831561659612)
     | > loss_disc_real_4: 0.2618330121040344  (0.24871416005396074)
     | > loss_disc_real_5: 0.23049794137477875  (0.2413994044726414)
     | > loss_0: 2.7141058444976807  (2.7415779617524914)
     | > grad_norm_0: tensor(2.3359, device='cuda:0')  (tensor(4.0910, device='cuda:0'))
     | > loss_gen: 1.8159717321395874  (1.9762149626208896)
     | > loss_kl: 5.008370399475098  (42.11719334894611)
     | > loss_feat: 1.7703579664230347  (2.0269679972721693)
     | > loss_mel: 30.50223731994629  (41.7430907526324)
     | > loss_duration: 9.647953033447266  (2.5745641668477357)
     | > amp_scaler: 256.0  (509.9354838709682)
     | > loss_1: 48.74489212036133  (90.438032304087)
     | > grad_norm_1: tensor(73.7072, device='cuda:0')  (tensor(215.5241, device='cuda:0'))
     | > current_lr_0: 0.0002 
     | > current_lr_1: 0.0002 
     | > step_time: 5.8922  (3.467874986510123)
     | > loader_time: 0.006  (0.005929248948251048)

Environment

- TTS version : 0.17.8
- python : 3.9.18
- pytorch : 2.1.1
- os : Linux
- gpu : A100

Additional context

No response

coqui-ai / TTS