NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.05k stars 2.51k forks source link

Cannot train colab with Mixer-TTS-X #4767

Closed gedefet closed 2 years ago

gedefet commented 2 years ago

Hi, I'm getting an error trying to execute the colab example: https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_MixerTTS_Training.ipynb with Mixer-TTS-X.

Execution cell:

assert isinstance(spec_gen, SpectrogramGenerator)

if isinstance(spec_gen, FastPitchModel):
    tokens = spec_gen.parse(str_input="Hey, this produces speech!")
else:
    tokens = spec_gen.parse(text="Hey, this produces speech!")

spectrogram = spec_gen.generate_spectrogram(tokens=tokens)

# Now we can visualize the generated spectrogram
# If we want to generate speech, we have to use a vocoder in conjunction to a spectrogram generator.
# Refer to the Inference_ModelSelect notebook on how to convert spectrograms to speech.
imshow(spectrogram.cpu().detach().numpy()[0,...], origin="lower")
plt.show()

The same error occurs if I try with MixerTTSModel:

assert isinstance(spec_gen, SpectrogramGenerator)

if isinstance(spec_gen, MixerTTSModel):
    tokens = spec_gen.parse("Hey, this produces speech!")
else:
    tokens = spec_gen.parse(text="Hey, this produces speech!")

spectrogram = spec_gen.generate_spectrogram(tokens=tokens)

# Now we can visualize the generated spectrogram
# If we want to generate speech, we have to use a vocoder in conjunction to a spectrogram generator.
# Refer to the Inference_ModelSelect notebook on how to convert spectrograms to speech.
imshow(spectrogram.cpu().detach().numpy()[0,...], origin="lower")
plt.show()

Just in case, the MixerTTSModel loaded is Mixer-TTS-X:

# In the same way, we can load the pre-trained Mixer-TTS model as follows
pretrained_model = "tts_en_lj_mixerttsx"
spec_gen = MixerTTSModel.from_pretrained(pretrained_model)
spec_gen.eval();
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-6-e1795f3aeeee>](https://localhost:8080/#) in <module>
      6     tokens = spec_gen.parse(text="Hey, this produces speech!")
      7 
----> 8 spectrogram = spec_gen.generate_spectrogram(tokens=tokens)
      9 
     10 # Now we can visualize the generated spectrogram

1 frames
[/usr/local/lib/python3.7/dist-packages/nemo/collections/tts/models/mixer_tts.py](https://localhost:8080/#) in generate_spectrogram(self, tokens, tokens_len, lm_tokens, raw_texts, norm_text_for_lm_model, lm_model)
    619         if self.cond_on_lm_embeddings and lm_tokens is None:
    620             if raw_texts is None:
--> 621                 raise ValueError("raw_texts must be specified if lm_tokens is None")
    622 
    623             lm_model_tokenizer = self._get_lm_model_tokenizer(lm_model)

ValueError: raw_texts must be specified if lm_tokens is None

Thanks,

gedefet commented 2 years ago

Adding another error:

Command:

!python mixer_tts.py \
sample_rate=22050 \
train_dataset=train.json \
validation_datasets=val.json \
sup_data_types="['align_prior_matrix', 'pitch' ]" \
sup_data_path={mixer_tts_sup_data_path} \
+phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.07 \
+heteronyms_path=tts_dataset_files/heteronyms-030921 \
whitelist_path=tts_dataset_files/lj_speech.tsv \
exp_manager.exp_dir=$OUTPUT_CHEKPOINTS \
pitch_mean={pitch_mean} \
pitch_std={pitch_std} \
model.train_ds.dataloader_params.batch_size=6 \
model.train_ds.dataloader_params.num_workers=0 \
model.validation_ds.dataloader_params.num_workers=0 \
trainer.max_epochs=5000 \
trainer.strategy=null \
trainer.check_val_every_n_epoch=50

Error:

[NeMo W 2022-08-18 19:20:52 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2022-08-18 19:20:53 experimental:28] Module <class 'nemo.collections.tts.torch.tts_tokenizers.IPATokenizer'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-08-18 19:20:53 experimental:28] Module <class 'nemo.collections.tts.models.radtts.RadTTSModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-08-18 19:20:54 exp_manager:286] Experiments will be logged at /content/drive/MyDrive/TTS/CHECKPOINTS/Mixer-TTS/inference_3/MixerTTS-X/2022-08-18_19-20-54
[NeMo I 2022-08-18 19:20:54 exp_manager:660] TensorboardLogger has been set up
[NeMo W 2022-08-18 19:20:54 nemo_logging:349] /usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py:2271: LightningDeprecationWarning: `Trainer.weights_save_path` has been deprecated in v1.6 and will be removed in v1.8.
      rank_zero_deprecation("`Trainer.weights_save_path` has been deprecated in v1.6 and will be removed in v1.8.")

[NeMo W 2022-08-18 19:20:54 exp_manager:900] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to -1. Please ensure that max_steps will run for at least 50 epochs to ensure that checkpointing will not error out.
[NeMo I 2022-08-18 19:20:56 tokenize_and_classify:87] Creating ClassifyFst grammars.
[NeMo I 2022-08-18 19:21:18 data:205] Loading dataset from train.json.

0it [00:00, ?it/s]
5it [00:00, 47.36it/s]
20it [00:00, 105.86it/s]
31it [00:00, 105.94it/s]
46it [00:00, 118.99it/s]
58it [00:00, 116.76it/s]
77it [00:00, 140.58it/s]
92it [00:00, 135.64it/s]

(trimming this...)

106it [00:00, 131.92it/s]
3468it [00:27, 134.27it/s]
3482it [00:27, 123.79it/s]
3499it [00:27, 129.83it/s]
3514it [00:27, 133.76it/s]
3531it [00:27, 142.87it/s]
3546it [00:28, 127.66it/s]
3561it [00:28, 132.69it/s]
3575it [00:28, 130.02it/s]
3589it [00:28, 130.61it/s]
3611it [00:28, 154.28it/s]
3626it [00:28, 126.89it/s]
[NeMo I 2022-08-18 19:21:46 data:242] Loaded dataset with 3626 files.
[NeMo I 2022-08-18 19:21:46 data:244] Dataset contains 2.42 hours.
[NeMo I 2022-08-18 19:21:46 data:346] Pruned 0 files. Final dataset contains 3626 files
[NeMo I 2022-08-18 19:21:46 data:349] Pruned 0.00 hours. Final dataset contains 2.42 hours.
[NeMo I 2022-08-18 19:21:46 data:205] Loading dataset from val.json.

0it [00:00, ?it/s]
20it [00:00, 177.60it/s]
38it [00:00, 148.82it/s]
54it [00:00, 136.07it/s]
69it [00:00, 139.25it/s]
84it [00:00, 121.20it/s]
97it [00:00, 116.42it/s]
113it [00:00, 125.42it/s]
135it [00:00, 147.34it/s]
151it [00:01, 150.82it/s]
167it [00:01, 151.10it/s]
183it [00:01, 150.61it/s]
199it [00:01, 140.36it/s]
218it [00:01, 150.61it/s]
230it [00:01, 143.31it/s]
[NeMo I 2022-08-18 19:21:48 data:242] Loaded dataset with 230 files.
[NeMo I 2022-08-18 19:21:48 data:244] Dataset contains 0.14 hours.
[NeMo I 2022-08-18 19:21:48 data:346] Pruned 0 files. Final dataset contains 230 files
[NeMo I 2022-08-18 19:21:48 data:349] Pruned 0.00 hours. Final dataset contains 0.14 hours.
Error executing job with overrides: ['sample_rate=22050', 'train_dataset=train.json', 'validation_datasets=val.json', "sup_data_types=['align_prior_matrix', 'pitch' ]", 'sup_data_path=mixer_tts_sup_data_folder', '+phoneme_dict_path=tts_dataset_files/cmudict-0.7b_nv22.07', '+heteronyms_path=tts_dataset_files/heteronyms-030921', 'whitelist_path=tts_dataset_files/lj_speech.tsv', 'exp_manager.exp_dir=/content/drive/MyDrive/TTS/CHECKPOINTS/Mixer-TTS/inference_3', 'pitch_mean=95.11185455322266', 'pitch_std=79.71340942382812', 'model.train_ds.dataloader_params.batch_size=6', 'model.train_ds.dataloader_params.num_workers=0', 'model.validation_ds.dataloader_params.num_workers=0', 'trainer.max_epochs=5000', 'trainer.strategy=null', 'trainer.check_val_every_n_epoch=50']
Traceback (most recent call last):
  File "mixer_tts.py", line 27, in main
    model = MixerTTSModel(cfg=cfg.model, trainer=trainer)
  File "/usr/local/lib/python3.7/dist-packages/nemo/collections/tts/models/mixer_tts.py", line 98, in __init__
    if self._train_dl is not None
AttributeError: 'MixerTTSXDataset' object has no attribute 'lm_padding_value'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Thanks!

redoctopus commented 2 years ago

The tutorial is meant to be run with Mixer-TTS rather than Mixer-TTS-X. If you would like to run inference using Mixer-TTS-X, you will need to add another argument to generate_spectrogram() like so:

spectrogram = spec_gen.generate_spectrogram(tokens=tokens, raw_texts=["Hey, this produces speech!"])

If the second error stems from a different source, please create a new GitHub Issue to track it.

gedefet commented 2 years ago

Thanks!

The second one comes from the same colab. I can make another ticket for that, but is a later cell in the same colab.

redoctopus commented 2 years ago

I am not able to reproduce the error--it looks like your command is different from the one in the notebook, which trains Mixer-TTS and therefore uses the TTSDataset rather than the MixerTTSXDataset.

If you are trying to train Mixer-TTS-X with your own data rather than using the notebook's setup with Mixer-TTS, please open another ticket with the training details?