NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.28k stars 2.54k forks source link

Tacotron Finetune Error(s) in loading state_dict for Tacotron2Model: size mismatch #5714

Closed dustinjoe closed 1 year ago

dustinjoe commented 1 year ago

Hi, Nemo Developer: Happy new year first.

Describe the bug I am trying to have a trial on voice clone. So I am trying to finetune a TTS model based on a test person's voice clips. Have successfully realized this using that FastPitch notebook. I am stuck on Tacotron2 finetune trial.

Steps/Code to reproduce bug I am trying to run that tacotron2_finetune.py for the training. As my audio clips are 22050hz, I modified the three parameters in the tacotron2_44100.yaml following the 22050hz original tacotron2.yaml training config file: n_window_size: 1024 n_window_stride: 256 n_fft: 1024 Other configs untouched.

Expected behavior I am running it as: !(python tacotron2_finetune.py --config-name=tacotron2_22050.yaml \ train_dataset=./TestFileList.json \ validation_datasets=./TestFileList.json \ exp_manager.exp_dir=./ljspeech_to_test_no_mixing \ +init_from_nemo_model=./tts_en_tacotron2.nemo \ trainer.max_epochs=20 \ trainer.check_val_every_n_epoch=5 \ model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 \ )

So I am mainly adding the pretrained model of the original pretrained tacotron2 model. The output is shown as below: [NeMo I 2022-12-29 23:33:49 features:267] PADDING: 16 [NeMo I 2022-12-29 23:33:49 save_restore_connector:243] Model Tacotron2Model was successfully restored from /media/xyzhou/extDisk2t1/DeepFake_Audio/Nemo_Clone/tts_nemo/tts_en_tacotron2.nemo. Error executing job with overrides: ['train_dataset=./EricFileList.json', 'validation_datasets=./EricFileList.json', 'exp_manager.exp_dir=./ljspeech_to_eric_no_mixing', '+init_from_nemo_model=./tts_en_tacotron2.nemo', 'trainer.max_epochs=20', 'trainer.check_val_every_n_epoch=5', 'model.train_ds.dataloader_params.batch_size=24', 'model.validation_ds.dataloader_params.batch_size=24'] Traceback (most recent call last): File "tacotron2_finetune.py", line 35, in main model.maybe_init_from_pretrained_checkpoint(cfg=cfg) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/lightning_utilities/core/rank_zero.py", line 24, in wrapped_fn return fn(*args, **kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/nemo/core/classes/modelPT.py", line 1066, in maybe_init_from_pretrained_checkpoint self.load_state_dict(restored_model.state_dict(), strict=False) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Tacotron2Model: size mismatch for text_embedding.weight: copying a param with shape torch.Size([69, 512]) from checkpoint, the shape in current model is torch.Size([114, 512]).

The the main error shown is the final part: RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Tacotron2Model: size mismatch for text_embedding.weight: copying a param with shape torch.Size([69, 512]) from checkpoint, the shape in current model is torch.Size([114, 512]).

Not sure how to deal with this. Thank you!

redoctopus commented 1 year ago

Can you try using the tacotron2.yaml config instead of modifying the tacotron2_44100.yaml config?

The 44.1 kHz config is older and I think might be outdated, and I think the existing NGC checkpoint would have used the tacotron2.yaml config, which uses a 22050 Hz sampling rate by default.

dustinjoe commented 1 year ago

Hi, I actually tried this. I run it with: !(python tacotron2_finetune.py --config-name=tacotron2.yaml \ train_dataset=./EricFileList.json \ validation_datasets=./EricFileList.json \ exp_manager.exp_dir=./ljspeech_to_eric_no_mixing \ +init_from_nemo_model=./tts_en_tacotron2.nemo \ trainer.max_epochs=20 \ trainer.check_val_every_n_epoch=5 \ model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 \ )

I also manually changed the line 26 of tacotron2_finetune.py from @hydra_runner(config_path="conf", config_name="tacotron2_44100") to @hydra_runner(config_path="conf", config_name="tacotron2") Tried both versions but both got errors. And here is the error which seems to be similar:

===================================================================================== [NeMo I 2023-01-04 22:21:22 features:267] PADDING: 16 [NeMo I 2023-01-04 22:21:22 save_restore_connector:243] Model Tacotron2Model was successfully restored from /media/xyzhou/extDisk2t1/DeepFake_Audio/Nemo_Clone/tts_nemo/tts_en_tacotron2.nemo. Error executing job with overrides: ['train_dataset=./EricFileList.json', 'validation_datasets=./EricFileList.json', 'exp_manager.exp_dir=./ljspeech_to_eric_no_mixing', '+init_from_nemo_model=./tts_en_tacotron2.nemo', 'trainer.max_epochs=20', 'trainer.check_val_every_n_epoch=5', 'model.train_ds.dataloader_params.batch_size=24', 'model.validation_ds.dataloader_params.batch_size=24'] Traceback (most recent call last): File "tacotron2_finetune.py", line 35, in main model.maybe_init_from_pretrained_checkpoint(cfg=cfg) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/lightning_utilities/core/rank_zero.py", line 24, in wrapped_fn return fn(*args, **kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/nemo/core/classes/modelPT.py", line 1066, in maybe_init_from_pretrained_checkpoint self.load_state_dict(restored_model.state_dict(), strict=False) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Tacotron2Model: size mismatch for text_embedding.weight: copying a param with shape torch.Size([69, 512]) from checkpoint, the shape in current model is torch.Size([114, 512]).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

==============================================================================================

Thank you!

redoctopus commented 1 year ago

Hmm, I tried running the fine-tuning example using

python examples/tts/tacotron2_finetune.py --config-name=tacotron2.yaml \
train_dataset=/data/hi_fi_tts_v0/small_9017_train.json \
validation_datasets=/data/hi_fi_tts_v0/small_9017_val.json \
trainer.max_epochs=5 \
+init_from_nemo_model=examples/tts/tts_en_tacotron2.nemo

and wasn't able to reproduce the size mismatch error, the model was loaded and trained as expected. I didn't need to change anything in the config itself.

Can you also double-check that you have the latest version of the checkpoint (from here) and are running the latest version of NeMo? There might have been changes that would cause this mismatch otherwise.

dustinjoe commented 1 year ago

Hi, thanks for your time and fast response. I just checked the version of Nemo and model. The training can run now but there is some issue with the inference. Let me start with training things.

=======================================================================================

Have used: pip install nemo_toolkit['all'] --upgrade to ensure Nemo is the latest. However, when I run 'print(Tacotron2Model.list_available_models())', the output only shows the old 1.0.0 version as below:

[PretrainedModelInfo( pretrained_model_name=tts_en_tacotron2, description=This model is trained on LJSpeech sampled at 22050Hz, and can be used to generate female English voices with an American accent., location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_tacotron2/versions/1.0.0/files/tts_en_tacotron2.nemo, class_=<class 'nemo.collections.tts.models.tacotron2.Tacotron2Model'> )]

In this way, when I was trying to load the pretrained Tacotron2 model using: model = Tacotron2Model.from_pretrained("tts_en_tacotron2") The model downloaded seems to be the old 1.0.0 version. So I manually downloaded the latest 1.1.0 version, which seems to have the dataset format change that makes training working now. I guess there should be some update here for later Nemo development?

============================================================================

For the inference part, I reuse the 'get_best_ckpt_from_last_run' and 'infer' function from FastPitch finetune. But the inference result seems to be some noise with a different length from the correct validation audio. The output length seems to be same as the decoder steps though.

Screenshot from 2023-01-05 14-03-43

The wrong spec image is as shown:

download

Any suggestions? Really thank you for your time and response!

redoctopus commented 1 year ago

However, when I run 'print(Tacotron2Model.list_available_models())', the output only shows the old 1.0.0 version

Oh, good catch! I'll put in a PR to fix this soon.


Regarding the second part...

But the inference result seems to be some noise with a different length from the correct validation audio.

The simple solution for the "noise" generated is that you may need to fine-tune for longer. At the beginning of fine-tuning, before the model has learned your new speaker, the predicted mels will look and sound like noise. Have you looked at the tensorboard logs? If loss is still going down, it's likely this.

As for the different lengths than the ground truth--this is expected if you are using infer() outside of training. Since no duration information is passed to the model, it predicts the output lengths itself, which will almost certainly not match up with the ground truth. (During training and validation, duration information is passed in and used, so that it will always match the GT for loss calculations, etc.)

dustinjoe commented 1 year ago

Thank you for your valuable suggestion! I have modified the training epochs as 1000 and it seems to be working normally now. Actually I am quite new to audio ML field. Regarding the voice clone trials, currently I feel like the quality of my generated audios have much noisy background, wondering if you have any suggestions on this kind of noise issues in audio? Thank you!

redoctopus commented 1 year ago

No problem!

Noise issues can stem from a number of things. The easiest in theory (but hard in practice!) is to get cleaner data to train on. If your training audio is recorded in a quieter environment with a better microphone, the output you get will also be noticeably cleaner, and vice versa. Beyond that, sometimes noise can stem from vocoding since the generated mels are fuzzy and don't look like "real" mel spectrograms. This can often be helped by fine-tuning the vocoder (as shown in the tutorial). We also have in review #5565, which may help!

Beyond that, there are more general audio noise reduction/voice cleaning methods, but they are not TTS-specific.

Best of luck!

dustinjoe commented 1 year ago

Thanks for your suggestion! I am going to have trials on vocoder finetuning for this. Another little question, I also had a trial on this tool: https://github.com/neonbjb/tortoise-tts Diffusion-based models seem to have a good adaptation to noisy samples, wondering if Nemo has any plan to incorporate diffusion based models? Thanks

redoctopus commented 1 year ago

Hmm. I don't think anyone is actively working on any diffusion-based models yet, but it'll be something to keep in mind!

dustinjoe commented 1 year ago

Hi, really thank you for your suggestion on vocoder finetuing. Have finetuned a vocoder using FastPitch. One interesting thing I observe is that this finetuned vocoder also seems to be effective on the Tacotron model to reduce noise. Do you know if there is any theoretical basis for this or is this just my wrong feeling? Thanks.

redoctopus commented 1 year ago

I wouldn't be surprised! The spectrograms generated by FastPitch don't look as sharp as ground truth spectrograms, and since it's a different speaker, the vocoder will benefit from learning what the new speaker's generated mels look like. Otherwise there will probably be some noise using an out-of-the-box vocoder. I'm guessing the new output is closer to the ground truth of the new speaker, which has less noise than the out-of-the-box inference result.

dustinjoe commented 1 year ago

Thank you for response. I have seen the fix has been merged. This is so fast!

dustinjoe commented 1 year ago

Hi, sorry for interrupting. I am making further trials on FastPitch. Actually, I tried another set of audios and successfully finetuned the specmodel model and it is working properly. But when I further tried to finetune the vocoder part using the same command, it shows the error of 'RuntimeError: stack expects each tensor to be equal size, but got [22630] at entry 0 and [66048] at entry 1' as follows:

=================================

[NeMo I 2023-01-12 13:13:56 save_restore_connector:243] Model HifiGanModel was successfully restored from /home/xyzhou/.cache/torch/NeMo/NeMo_1.15.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo. [NeMo I 2023-01-12 13:13:56 modelPT:1163] Model checkpoint restored from pretrained checkpoint with name : tts_hifigan Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.


distributed_backend=nccl All distributed processes registered. Starting with 1 processes


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params


0 | audio_to_melspec_precessor | FilterbankFeatures | 0
1 | trg_melspec_fn | FilterbankFeatures | 0
2 | generator | Generator | 13.9 M 3 | mpd | MultiPeriodDiscriminator | 41.1 M 4 | msd | MultiScaleDiscriminator | 29.6 M 5 | feature_loss | FeatureMatchingLoss | 0
6 | discriminator_loss | DiscriminatorLoss | 0
7 | generator_loss | GeneratorLoss | 0


84.7 M Trainable params 0 Non-trainable params 84.7 M Total params 338.643 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s]Error executing job with overrides: ['model.train_ds.dataloader_params.batch_size=32', 'model.max_steps=1000', 'model.optim.lr=0.00001', '~model.optim.sched', 'train_dataset=./hifigan_train_ft.json', 'validation_datasets=./hifigan_train_ft.json', 'exp_manager.exp_dir=hifigan_ft', '+init_from_pretrained_model=tts_hifigan', 'trainer.check_val_every_n_epoch=50', 'model/train_ds=train_ds_finetune', 'model/validation_ds=val_ds_finetune'] Traceback (most recent call last): File "hifigan_finetune.py", line 28, in main trainer.fit(model) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch return function(*args, *kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train self._run_sanity_check() File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check val_loop.run() File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance batch = next(data_fetcher) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next return self.fetching_function() File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function self._fetch_next_batch(self.dataloader_iter) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch batch = next(iterator) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/nemo/core/classes/common.py", line 1077, in call return wrapped(args, **kwargs) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/nemo/core/classes/dataset.py", line 59, in collate_fn return self._collate_fn(batch) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/nemo/collections/tts/torch/data.py", line 971, in _collate_fn return torch.utils.data.dataloader.default_collate(batch) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate return collate(batch, collate_fn_map=default_collate_fn_map) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in collate return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility. File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 120, in collate return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map) File "/home/xyzhou/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 163, in collate_tensor_fn return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [22630] at entry 0 and [66048] at entry 1

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

===========================================

Feel a little strange because this same procedure worked on the previous set of audio files. Not sure if you have any suggestions on this? Thanks!

redoctopus commented 1 year ago

Hmm, can you double-check a few things?

If these suggestions don't resolve the issue, please open a new issue and provide your training command (to keep the issues more organized).

dustinjoe commented 1 year ago

Thanks. I have tried removing data files. And regenerating mel files. Seems no luck by far. I will open a new issue for this. Thank you for your time!

dustinjoe commented 1 year ago

Really thank you for your time and help. Actually I have found out the mistake. That is a really stupid one. When I downloaded audio from youtube using youtube-dl lib as wav file. It did not check the sampling rate so it was actually 44100 rather than the needed 22050. This is working for the FastPitch finetuning part but not working when the target sampling rate is is not assigned in the HifiGan finetuning part. Currently, the vocoder finetuning is also working now. Thank you!

redoctopus commented 1 year ago

Glad you found the issue!

dustinjoe commented 1 year ago

Really thank you for your time and help! Not sure if it is a reasonable feature suggestion. Does your team have plan to incorporate Whisper from OpenAI in Nemo? I personally feel the performance of Whisper is better than the current included ASR models in Nemo. Thanks.

redoctopus commented 1 year ago

No problem :)

I'm not sure there are any plans to incorporate Whisper right now, but I think the ASR team might be able to help you more with answering those questions.

dustinjoe commented 1 year ago

Thanks. I did not know ASR has a separate team for this project~ But great to hear there is a large systematic team for this project! Well, one little question about the TTS then, is there an efficient way to configure the speech speed of the synthesized audio? Thank you.

redoctopus commented 1 year ago

Sure! FastPitch's generate_spectrogram() function has a pace argument (float, defaults to 1.0). You can raise or lower it to adjust the speed of the synthesized speech; 2.0 will make it 2x the speed.

Note: If you are not using generate_spectrogram(), it is an argument to forward(). So you can directly input it there if need be.

dustinjoe commented 1 year ago

Thanks! That is really helpful.

dustinjoe commented 1 year ago

Hi, sorry for interrupting again. I am having some trials on synthesizing a whole paragraph with multiple sentences. I had some trials and found out feeding a whole paragraph into model is not a good idea. So I wrote a function as follows to synthesize multiple sentences using Nemo here. Hope this would be somehow helpful. The idea is to add some random blank between sentences also adding some random speed manipulation for each sentence following the method you provided above.

=====================================================

import numpy as np import re sample_rate = 22050

def tts_multisentence(text_para,sample_rate=22050):

Using re.split()

# Splitting characters in String
text_para_list = re.split('[.?!;]', text_para)
print(type(text_para_list))
print(text_para_list)
audio=np.zeros((1,int(sample_rate\*0.1)))
for sentence in text_para_list:
    text_input = sentence
    # randomize speed for each sentence
    random_speedfactor = np.random.uniform(0.8,1.2)
    spec, audio_new = infer(spec_model, vocoder, text_input, speaker=speaker_id,pace=random_speedfactor)    
    audio = np.concatenate((audio, audio_new), axis=1)
    # add a short random blank part between sentences
    random_blank_time = np.random.uniform(0.4, 0.7)
    audio_blank = np.zeros((1,int(sample_rate\*random_blank_time)))
    audio = np.concatenate((audio, audio_blank), axis=1)   
return audio

audio = tts_multisentence(text_para,sample_rate=sample_rate) ipd.display(ipd.Audio(audio, rate=22050))

=============================================

Not sure if you have other suggestions on synthesizing long texts? I guess long text synthesis could become a good feature or tutorial for Nemo development? Another question I have is I am seeking some suggestions on quality improvement. I feel like the audio I currently get sounds too flat. Also there often seems to be errors at the end of sentence for the synthesis. Do you have any suggestions on these? Thank you for your time and help!

redoctopus commented 1 year ago

We have seen issues in the past when generating long texts, especially if the training input consists of a lot of short sentences. One way to mitigate this a bit if you have sequential training data is to make sure that some of it contains more than one sentence, so that the model learns a little about how to transition between sentences without being too choppy and to navigate those punctuation marks.

Other than that, we haven't really experimented much with it (but I'd like to at some point...).

dustinjoe commented 1 year ago

Thanks for your suggestion. I guess these issues are still open questions for current TTS. Even though I use the script to synthesize one sentence by one sentence, there are two obvious limitations I can see: (1) the length of blank breaks between sentences and the speed for sentences are randomized, this does not take the context and sentiment of the sentence itself into consideration; (2) the blank breaks themselves may have some little issue if the output after the vocoder still has some noticable background noise. Adding short blank chunks may lead to some little sudden breaks in sounds. Have not found a good solution yet. But I will have a trial to reduce the number of short sentences in the training dataset to see this would help. I noticed that in some commercial products there are features to add sentiment masking to the audio like 'happy','angry' or others. Is there any similar features we can use in the Nemo? I guess this kind of feature could make the TTS more active. Thanks

redoctopus commented 1 year ago

Haha, I'll admit to having played around with the generated pauses by adding more punctuation in the past. It doesn't work great but sometimes I could change the duration to something closer to what I wanted. I generally don't recommend doing this, but it's kind of fun to see what happens when you switch out commas, semicolons, etc.

You might be able to get away with a more involved "hack" if you performed inference twice: once to get predicted durations for each token, and then a second time where you pass in a modified duration tensor (add or subtract time steps for the indices corresponding to your pauses). The caveat is that I've never tried this, so I'm not sure if it would work well at all.

I noticed that in some commercial products there are features to add sentiment masking to the audio like 'happy','angry' or others. Is there any similar features we can use in the Nemo? I guess this kind of feature could make the TTS more active.

Nothing planned right now, but it is something we're interested in at some point down the line. Not sure when we'll have the resources to get around to it...

dustinjoe commented 1 year ago

HAHA, seems that I am not the only one trying to add comma things in between...... I was thinking whether there could be an addtional auxilary network model for adding these kind of additional punctuations to makes TTS outputs more stable. This might be useful for dividing long sentences into shorter ones, which seems to be a general principle for improving TTS quality. Speaking of audio quality, is there any good metrics to measure human-level opinions other than the subject way of Mean Opinion Score?
I noticed that there is a new research direction to predict MOS though, https://voicemos-challenge-2022.github.io/, not sure if this would be really reliable. Feel like using MOS makes it a little difficult to improve the existing TTS model pipeline. But if indeed there could exist an automated way to evaluate, it can first be used for training dataset selection to improve the TTS training from scratch. Thank you!

redoctopus commented 1 year ago

For audio quality, there are some existing metrics like PESQ that were developed for evaluating call quality. They'd give you some rudimentary idea of how intelligible a model is, but probably not much in the way of "naturalness" or pronunciation.

We recently also ported an evaluation notebook from Riva: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Evaluation_MelCepstralDistortion.ipynb

This one demonstrates mel cepstral distortion with dynamic time warping to compare model outputs vs. ground truth, so it's best for comparing multiple models--it won't be as useful if you only have one model to quality check.

MOS is definitely a very expensive quality metric, so it would be great to have some reliable automatic metrics that are comparable! In the meantime, we can at least narrow down what models are best with some of the existing ones like PESQ and MCD, so that we can perform fewer MOS trials.

dustinjoe commented 1 year ago

Thank you for your guidance! I am learning these metrics as you suggested. Not sure using them to do a data selection first and then TTS finetuning would make a big difference. Btw, in my original question of Tacotron2 finetuning, you mentioned the yaml of 44100 is outdated, do you need to add a little PR to remove that file?

redoctopus commented 1 year ago

Ah yeah, good point. Going to jot that down on the TODO list... Thanks!