VALL-E

Inference code for Wenetspeech4TTS/Audiodec-Valle-Wenetspeech4TTS

Installation

  git clone https://github.com/dukGuo/valle-audiodec.git
  cd valle-audiodec
  pip install -r requirements.txt

Download pre-train model

AudioDec

We use AudioDec as our speech tokenizer instead of encodec to further improve audio quality.

Please download the whole exp folder, unzip and put it in the AudioDec/exp directory.

cd valle-audiodec
wget https://github.com/facebookresearch/AudioDec/releases/download/pretrain_models_v02/exp.zip
unzip exp.zip
mv exp AudioDec/exp

VALL-E

Checkpiont available on Wenetspeech4TTS/Audiodec-Valle-Wenetspeech4TTS

VALL-E Basic :VALL-E trained with the WenetSpeech4TTS Basic subset
VALL-E Standard: VALL-E Basic fine-tuning with the WenetSpeech4TTS Standard subset
VALL-E Premium: VALL-E Standard fine-tuning with the WenetSpeech4TTS Premium subset
Speech Sample

https://wenetspeech4tts.github.io/wenetspeech4tts

https://rxy-j.github.io/HPMD-TTS

Inference

  cd valle-audiodec
  python infer_tts.py \ 
    --config config/hparams.yaml \
    --ar_ckpt ckpt/basic/ar.pt \
    --nar_ckpt ckpt/basic/nar.pt \
    --prompt_wav test/prompt_wavs/test_1.wav \
    --prompt_text 在夏日阴凉的树荫下，鸭妈妈孵着鸭宝宝。 \
    --text 负责指挥的将军在一旁交代着注意事项，每个人在上面最多只能待九十秒。

To improve audio quality and ensure consistent volume levels across different inputs, it is advisable to normalize the loudness of the prompt waveform before conducting inference. This preprocessing step helps achieve uniformity in the audio input, which can lead to more reliable inference outcomes.
sox $in_wave -r $sample_rate -b 16 --norm=-6 $out_wave

References

This repository is developed based on the following repositories.

dukGuo / valle-audiodec

readme

VALL-E

Installation

Download pre-train model

AudioDec

VALL-E

Speech Sample

Inference

References