pretrain chinese TTS generate a bed audio

Describe the bug

I follow the sample code in NGC to generate chinese TTS audio but get a bed result

I also try other models (e.g. english, german and spanish), but it works fine

Steps/Code to reproduce bug

# Load spectrogram generator
from nemo.collections.tts.models import FastPitchModel
spec_generator = FastPitchModel.from_pretrained(model_name="tts_zh_fastpitch_sfspeech")

# Load Vocoder
from nemo.collections.tts.models import HifiGanModel
model = HifiGanModel.from_pretrained(model_name="tts_zh_hifigan_sfspeech")

# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
    parsed = spec_generator.parse("这些新一代的CPU不只效能惊人。")
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.T, 22050, format='WAV')

Environment details

OS: ubuntu 20.04 NeMo : brench 1.20.0 python : 3.8 pytorch : 2.0.1

speech.zip

this is the audio file that generated by the sample code I can't recognize what's talking about.

I used following steps to train the model from scratch by using my own RTX3080Ti:

# [pre] download dataset from https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en
python3 scripts/dataset_processing/tts/sfbilingual/get_data.py \
    --data-root data_sfbi \
    --val-size 0.1 \
    --test-size 0.2 \
    --seed-for-ds-split 100 \
    --manifests-path mani_sfbi

python3 scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path sfbilingual/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=mani_sfbi/train_manifest.json \
    sup_data_path=sup_sfbi

python3 examples/tts/fastpitch.py --config-path conf/zh/ \
    --config-name fastpitch_align_22050.yaml \
    model.train_ds.dataloader_params.batch_size=16 \
    model.validation_ds.dataloader_params.batch_size=16 \
    train_dataset=mani_sfbi/train_manifest.json \
    validation_datasets=mani_sfbi/val_manifest.json \
    sup_data_path=sup_sfbi \
    exp_manager.exp_dir=resultBi \
    trainer.max_epochs=200 \
    trainer.check_val_every_n_epoch=1 \
    pitch_mean=226.7923126220703 \
    pitch_std=59.07200622558594 \
    +exp_manager.create_wandb_logger=true \
    +exp_manager.wandb_logger_kwargs.name="tutorial" \
    +exp_manager.wandb_logger_kwargs.project="sfbi"

Finally, I got this model checkpoint. You can download to your local directory.

It could be used by the following snippet:

from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

spec_generator = FastPitchModel.restore_from("~/Downloads/FastPitch.nemo")
model = HifiGanModel.from_pretrained(model_name="tts_zh_hifigan_sfspeech")

# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
    parsed = spec_generator.parse("这些新一代的CPU不只效能惊人。")
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.T, 22050, format='WAV')

The result sounds correct (definitely Chinese): speech.zip

Therefore I guess the default tts_zh_fastpitch_sfspeech model is corrupted. @titu1994 @XuesongYang Could you help re-train the tts_zh_fastpitch_sfspeech model? Or you can also directly use mine :)

thank you for your help !!

I used following steps to train the model from scratch by using my own RTX3080Ti:我使用自己的 RTX3080Ti 按照以下步骤从头开始训练模型：
# [pre] download dataset from https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en
python3 scripts/dataset_processing/tts/sfbilingual/get_data.py \
    --data-root data_sfbi \
    --val-size 0.1 \
    --test-size 0.2 \
    --seed-for-ds-split 100 \
    --manifests-path mani_sfbi

python3 scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path sfbilingual/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=mani_sfbi/train_manifest.json \
    sup_data_path=sup_sfbi

python3 examples/tts/fastpitch.py --config-path conf/zh/ \
    --config-name fastpitch_align_22050.yaml \
    model.train_ds.dataloader_params.batch_size=16 \
    model.validation_ds.dataloader_params.batch_size=16 \
    train_dataset=mani_sfbi/train_manifest.json \
    validation_datasets=mani_sfbi/val_manifest.json \
    sup_data_path=sup_sfbi \
    exp_manager.exp_dir=resultBi \
    trainer.max_epochs=200 \
    trainer.check_val_every_n_epoch=1 \
    pitch_mean=226.7923126220703 \
    pitch_std=59.07200622558594 \
    +exp_manager.create_wandb_logger=true \
    +exp_manager.wandb_logger_kwargs.name="tutorial" \
    +exp_manager.wandb_logger_kwargs.project="sfbi"
Finally, I got this model checkpoint. You can download to your local directory.最后，我得到了这个模型检查点。您可以下载到本地目录。

It could be used by the following snippet:它可以由以下代码片段使用：
from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

spec_generator = FastPitchModel.restore_from("~/Downloads/FastPitch.nemo")
model = HifiGanModel.from_pretrained(model_name="tts_zh_hifigan_sfspeech")

# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
    parsed = spec_generator.parse("这些新一代的CPU不只效能惊人。")
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.T, 22050, format='WAV')
The result sounds correct (definitely Chinese):结果听起来是正确的（绝对是中文的）： speech.zip 演讲.zip

Therefore I guess the default tts_zh_fastpitch_sfspeech model is corrupted.因此我猜测默认的 tts_zh_fastpitch_sfspeech 模型已损坏。 @titu1994 @XuesongYang Could you help re-train the tts_zh_fastpitch_sfspeech model? Or you can also directly use mine :) 您能帮忙重新训练 tts_zh_fastpitch_sfspeech 模型吗？或者你也可以直接使用我的:)

Why did this issue stop here? The official source models are still broken. I found that many zh models are damaged or stopped or removed . Is it because of some political reasons?

I used following steps to train the model from scratch by using my own RTX3080Ti:我使用自己的 RTX3080Ti 按照以下步骤从头开始训练模型：
# [pre] download dataset from https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en
python3 scripts/dataset_processing/tts/sfbilingual/get_data.py \
    --data-root data_sfbi \
    --val-size 0.1 \
    --test-size 0.2 \
    --seed-for-ds-split 100 \
    --manifests-path mani_sfbi

python3 scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path sfbilingual/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=mani_sfbi/train_manifest.json \
    sup_data_path=sup_sfbi

python3 examples/tts/fastpitch.py --config-path conf/zh/ \
    --config-name fastpitch_align_22050.yaml \
    model.train_ds.dataloader_params.batch_size=16 \
    model.validation_ds.dataloader_params.batch_size=16 \
    train_dataset=mani_sfbi/train_manifest.json \
    validation_datasets=mani_sfbi/val_manifest.json \
    sup_data_path=sup_sfbi \
    exp_manager.exp_dir=resultBi \
    trainer.max_epochs=200 \
    trainer.check_val_every_n_epoch=1 \
    pitch_mean=226.7923126220703 \
    pitch_std=59.07200622558594 \
    +exp_manager.create_wandb_logger=true \
    +exp_manager.wandb_logger_kwargs.name="tutorial" \
    +exp_manager.wandb_logger_kwargs.project="sfbi"
Finally, I got this model checkpoint. You can download to your local directory.最后，我得到了这个模型检查点。您可以下载到本地目录。 It could be used by the following snippet:它可以由以下代码片段使用：
from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

spec_generator = FastPitchModel.restore_from("~/Downloads/FastPitch.nemo")
model = HifiGanModel.from_pretrained(model_name="tts_zh_hifigan_sfspeech")

# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
    parsed = spec_generator.parse("这些新一代的CPU不只效能惊人。")
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.T, 22050, format='WAV')
The result sounds correct (definitely Chinese):结果听起来是正确的（绝对是中文的）： speech.zip 演讲.zip Therefore I guess the default tts_zh_fastpitch_sfspeech model is corrupted.因此我猜测默认的 tts_zh_fastpitch_sfspeech 模型已损坏。 @titu1994 @XuesongYang Could you help re-train the tts_zh_fastpitch_sfspeech model? Or you can also directly use mine :) 您能帮忙重新训练 tts_zh_fastpitch_sfspeech 模型吗？或者你也可以直接使用我的:)
Why did this issue stop here? The official source models are still broken. I found that many zh models are damaged or stopped or removed . Is it because of some political reasons?

You are right. Many ZH models are damaged and seems nobody will fix them. It is not because of any political reasons, it's just because of the "big company disease": this repo and its related models are not the core business of Nvidia so the developers from Nvidia company will not try their hard to fix them. For us, the "wild developers", the only choice is to fix them by training them ourselves.

The truth is: that big companies have a lot of hardware resources but they don't even want to fix a damaged model published by themselves for the open-source community; whereas individual developers want to contribute to the open-source projects but don't have enough hardware resources (mainly GPUs). For example, I am trying to build an open-source multi-modal model repo, but lacking GPUs makes this progress terribly slow.

I used following steps to train the model from scratch by using my own RTX3080Ti:我使用自己的 RTX3080Ti 按照以下步骤从头开始训练模型：
# [pre] download dataset from https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en
python3 scripts/dataset_processing/tts/sfbilingual/get_data.py \
    --data-root data_sfbi \
    --val-size 0.1 \
    --test-size 0.2 \
    --seed-for-ds-split 100 \
    --manifests-path mani_sfbi

python3 scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path sfbilingual/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=mani_sfbi/train_manifest.json \
    sup_data_path=sup_sfbi

python3 examples/tts/fastpitch.py --config-path conf/zh/ \
    --config-name fastpitch_align_22050.yaml \
    model.train_ds.dataloader_params.batch_size=16 \
    model.validation_ds.dataloader_params.batch_size=16 \
    train_dataset=mani_sfbi/train_manifest.json \
    validation_datasets=mani_sfbi/val_manifest.json \
    sup_data_path=sup_sfbi \
    exp_manager.exp_dir=resultBi \
    trainer.max_epochs=200 \
    trainer.check_val_every_n_epoch=1 \
    pitch_mean=226.7923126220703 \
    pitch_std=59.07200622558594 \
    +exp_manager.create_wandb_logger=true \
    +exp_manager.wandb_logger_kwargs.name="tutorial" \
    +exp_manager.wandb_logger_kwargs.project="sfbi"
Finally, I got this model checkpoint. You can download to your local directory.最后，我得到了这个模型检查点。您可以下载到本地目录。 It could be used by the following snippet:它可以由以下代码片段使用：
from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

spec_generator = FastPitchModel.restore_from("~/Downloads/FastPitch.nemo")
model = HifiGanModel.from_pretrained(model_name="tts_zh_hifigan_sfspeech")

# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
    parsed = spec_generator.parse("这些新一代的CPU不只效能惊人。")
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.T, 22050, format='WAV')
The result sounds correct (definitely Chinese):结果听起来是正确的（绝对是中文的）： speech.zip 演讲.zip Therefore I guess the default tts_zh_fastpitch_sfspeech model is corrupted.因此我猜测默认的 tts_zh_fastpitch_sfspeech 模型已损坏。 @titu1994 @XuesongYang Could you help re-train the tts_zh_fastpitch_sfspeech model? Or you can also directly use mine :) 您能帮忙重新训练 tts_zh_fastpitch_sfspeech 模型吗？或者你也可以直接使用我的:)
Why did this issue stop here? The official source models are still broken. I found that many zh models are damaged or stopped or removed . Is it because of some political reasons?
You are right. Many ZH models are damaged and seems nobody will fix them. It is not because of any political reasons, it's just because of the "big company disease": this repo and its related models are not the core business of Nvidia so the developers from Nvidia company will not try their hard to fix them. For us, the "wild developers", the only choice is to fix them by training them ourselves.

The truth is: that big companies have a lot of hardware resources but they don't even want to fix a damaged model published by themselves for the open-source community; whereas individual developers want to contribute to the open-source projects but don't have enough hardware resources (mainly GPUs). For example, I am trying to build an open-source multi-modal model repo, but lacking GPUs makes this progress terribly slow.

You are a warm-hearted person. I used to do game development, and I have recently tried to understand AI. However, open source in the AI field is really strange. Many projects lack maintenance after they are established, and it is difficult to communicate to some maintainers in a human way, very strange.( I guess there are some AI assistants act as relay operators? The behavior they exhibit often doesn't look like real people.) After discovering the lack of maintenance of nemo in Chinese voice, I switched to using a small framework, netease-youdao/EmotiVoice, but their members are even weirder. They often talk into the air, and normal communication sometimes is not possible. There may not necessarily be corresponding replies, and they may throw out many completely irrelevant replies just like a mobile assistant more than 10 years ago. It’s weird. The atmosphere in the AI field is so special compared to other software fields.

I used following steps to train the model from scratch by using my own RTX3080Ti:我使用自己的 RTX3080Ti 按照以下步骤从头开始训练模型：
# [pre] download dataset from https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en
python3 scripts/dataset_processing/tts/sfbilingual/get_data.py \
    --data-root data_sfbi \
    --val-size 0.1 \
    --test-size 0.2 \
    --seed-for-ds-split 100 \
    --manifests-path mani_sfbi

python3 scripts/dataset_processing/tts/extract_sup_data.py \
    --config-path sfbilingual/ds_conf \
    --config-name ds_for_fastpitch_align.yaml \
    manifest_filepath=mani_sfbi/train_manifest.json \
    sup_data_path=sup_sfbi

python3 examples/tts/fastpitch.py --config-path conf/zh/ \
    --config-name fastpitch_align_22050.yaml \
    model.train_ds.dataloader_params.batch_size=16 \
    model.validation_ds.dataloader_params.batch_size=16 \
    train_dataset=mani_sfbi/train_manifest.json \
    validation_datasets=mani_sfbi/val_manifest.json \
    sup_data_path=sup_sfbi \
    exp_manager.exp_dir=resultBi \
    trainer.max_epochs=200 \
    trainer.check_val_every_n_epoch=1 \
    pitch_mean=226.7923126220703 \
    pitch_std=59.07200622558594 \
    +exp_manager.create_wandb_logger=true \
    +exp_manager.wandb_logger_kwargs.name="tutorial" \
    +exp_manager.wandb_logger_kwargs.project="sfbi"
Finally, I got this model checkpoint. You can download to your local directory.最后，我得到了这个模型检查点。您可以下载到本地目录。 It could be used by the following snippet:它可以由以下代码片段使用：
from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

spec_generator = FastPitchModel.restore_from("~/Downloads/FastPitch.nemo")
model = HifiGanModel.from_pretrained(model_name="tts_zh_hifigan_sfspeech")

# Generate audio
import soundfile as sf
import torch
with torch.no_grad():
    parsed = spec_generator.parse("这些新一代的CPU不只效能惊人。")
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()

# Save the audio to disk in a file called speech.wav
sf.write("speech.wav", audio.T, 22050, format='WAV')
The result sounds correct (definitely Chinese):结果听起来是正确的（绝对是中文的）： speech.zip 演讲.zip Therefore I guess the default tts_zh_fastpitch_sfspeech model is corrupted.因此我猜测默认的 tts_zh_fastpitch_sfspeech 模型已损坏。 @titu1994 @XuesongYang Could you help re-train the tts_zh_fastpitch_sfspeech model? Or you can also directly use mine :) 您能帮忙重新训练 tts_zh_fastpitch_sfspeech 模型吗？或者你也可以直接使用我的:)
Why did this issue stop here? The official source models are still broken. I found that many zh models are damaged or stopped or removed . Is it because of some political reasons?
You are right. Many ZH models are damaged and seems nobody will fix them. It is not because of any political reasons, it's just because of the "big company disease": this repo and its related models are not the core business of Nvidia so the developers from Nvidia company will not try their hard to fix them. For us, the "wild developers", the only choice is to fix them by training them ourselves.

The truth is: that big companies have a lot of hardware resources but they don't even want to fix a damaged model published by themselves for the open-source community; whereas individual developers want to contribute to the open-source projects but don't have enough hardware resources (mainly GPUs). For example, I am trying to build an open-source multi-modal model repo, but lacking GPUs makes this progress terribly slow.

How many GPUs do you need?

NVIDIA / NeMo

pretrain chinese TTS generate a bed audio #7389