NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.55k stars 2.42k forks source link

Hifigan Finetuning Error #7625

Closed potatocharlie1 closed 9 months ago

potatocharlie1 commented 11 months ago

Describe the bug I am trying to finetune hifigan with spectrograms created with a previously finetuned FastPitch, following the tutorials: FastPitch_Finetuning.ipynb and FastPitch_GermanTTS_Training.ipynb. Finetuning Fastpitch went well, however when I try to finetune hifigan on the same data is always raises this error:

Sanity Checking: 0it [00:00, ?it/s]Error executing job with overrides: ['model.max_steps=10', 'model.optim.lr=0.00001', '~model.optim.sched', 'train_dataset=./sad_data_manifest_train_local_mel.json', 'validation_datasets=./sad_data_manifest_test_local_mel.json', 'exp_manager.exp_dir=hifigan_ft', '+trainer.val_check_interval=5', '+init_from_pretrained_model=tts_en_hifigan', 'trainer.check_val_every_n_epoch=null', 'model/train_ds=train_ds_finetune', 'model/validation_ds=val_ds_finetune'] Traceback (most recent call last): File "/mnt/c/Users/charl/Documents/synvoice/hifigan_finetune.py", line 28, in main trainer.fit(model) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit call._call_and_handle_interrupt( File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, *kwargs) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run results = self._run_stage() File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage self._run_sanity_check() File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check val_loop.run() File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator return loop_run(self, args, kwargs) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 108, in run batch, batch_idx, dataloader_idx = next(data_fetcher) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 137, in next self._fetch_next_batch(self.dataloader_iter) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 151, in _fetch_next_batch batch = next(iterator) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 285, in next out = next(self._iterator) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 123, in next out = next(self.iterators[0]) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next data = self._next_data() File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data return self._process_data(data) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data data.reraise() File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise raise exception ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/site-packages/nemo/collections/tts/data/dataset.py", line 1144, in getitem start = random.randint(0, mel.shape[1] - frames - 1) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/random.py", line 370, in randint return self.randrange(a, b+1) File "/home/charlie/anaconda3/envs/nemo/lib/python3.10/random.py", line 353, in randrange raise ValueError("empty range for randrange() (%d, %d, %d)" % (istart, istop, width)) ValueError: empty range for randrange() (0, -12, -12)

Steps/Code to reproduce bug generates mels using the generate_mels.py:

python generate_mels.py \
    --cpu \
    --fastpitch-model-ckpt {fastpitch_model_path} \
    --input-json-manifests sad_data_manifest_train_local.json sad_data_manifest_test_local.json \
    --output-json-manifest-root ./

and then started finetuning:

python hifigan_finetune.py --config-name=hifigan.yaml \
    model.max_steps=10 model.optim.lr=0.00001 ~model.optim.sched \
    train_dataset=./sad_data_manifest_train_local_mel.json \
    validation_datasets=./sad_data_manifest_test_local_mel.json \ 
    exp_manager.exp_dir=hifigan_ft \
    +trainer.val_check_interval=5 \
    +init_from_pretrained_model=tts_en_hifigan \ 
    trainer.check_val_every_n_epoch=null \
    model/train_ds=train_ds_finetune \
    model/validation_ds=val_ds_finetune \

Expected behavior

Hifigan finetuning starts

Environment overview (please complete the following information)

Environment details

OS: Linux 5.15.90.1-microsoft-standard-WSL2 PyTorch version: 2.2.0.dev20230913 Python version: 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]

Additional context

I am using parts of the iemocap dataset. The files differ in length, but generating the mels already pads/cuts them to roughly 6 seconds. The mels from generate_mels.py look good. I have looked into the other similar issues and therefore changed to the generate_mels.py code, but it did not fix the issue.

XuesongYang commented 11 months ago

The error happenedn when doing sanity check with validation datasets. By default, it should use one batch for this check. From the error trace ValueError: Caught ValueError in DataLoader worker process 0., there may be several issues as described below,

  1. the number of audio files: does your val dataset have enough audio examples more than the validation dataset batch size?
  2. duration of each audio: by default, the minimum duration for validation step is 3 seconds (https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/hifigan/hifigan.yaml#L26), otherwise, ones will be filter out. Does your val dataset have enough audio examples after filtering out short examples?

A side notes: the above diagnosis is also applied for training dataset, please double-check your training datasets as well in case the same error happens during the training.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 9 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.