MIT License
  1. Download the pretrained phonemizer checkpoint
    wget https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt

Preprocess the dataset

  1. Get the GigaSpeech dataset from the official repo
  2. Install FFmpeg, then
    conda install ffmpeg=4.3=hf484d3e_0
    conda update ffmpeg
  3. Run python script
    python preprocess.py --giga_speech_dir GIGASPEECH --outputdir datasets 

Train the quantizer and inference

  1. Train

    cd quantizer/
    python train.py --input_wavs_dir ../datasets/audios \
                --input_training_file ../datasets/training.txt \
                --input_validation_file ../datasets/validation.txt \
                --checkpoint_path ./checkpoints \
                --config config.json
  2. Inference to get codes for training the second stage

    python get_labels.py --input_json ../datasets/train.json \
                     --input_wav_dir ../datasets/audios \
                     --output_json ../datasets/train_q.json \
                     --checkpoint_file ./checkpoints/g_{training_steps}
    python get_labels.py --input_json ../datasets/dev.json \
                     --input_wav_dir ../datasets/audios \
                     --output_json ../datasets/dev_q.json \
                     --checkpoint_file ./checkpoints/g_{training_steps}

Train the transformer (below an example for the 100M version)

cd ..
mkdir ckpt
python train.py \
     --distributed \
     --saving_path ckpt/ \
     --sampledir logs/ \
     --vocoder_config_path quantizer/checkpoints/config.json \
     --vocoder_ckpt_path quantizer/checkpoints/g_{training_steps} \
     --datadir datasets/audios \
     --metapath datasets/train_q.json \
     --val_metapath datasets/dev_q.json \
     --use_repetition_token \
     --ar_layer 4 \
     --ar_ffd_size 1024 \
     --ar_hidden_size 256 \
     --ar_nheads 4 \
     --speaker_embed_dropout 0.05 \
     --enc_nlayers 6 \
     --dec_nlayers 6 \
     --ffd_size 3072 \
     --hidden_size 768 \
     --nheads 12 \
     --batch_size 200 \
     --precision bf16 \
     --training_step 800000 \
     --layer_norm_eps 1e-05

You can view the progress using:

tensorboard --logdir logs/

Run batched inference

You'll have to change speaker_to_text.json, it's just an example.

mkdir infer_samples
CUDA_VISIBLE_DEVICES=0 python infer.py \
    --phonemizer_dict_path en_us_cmudict_forward.pt \
    --model_path ckpt/last.ckpt \
    --config_path ckpt/config.json \
    --input_path speaker_to_text.json \
    --outputdir infer_samples \
    --batch_size {batch_size} \
    --top_p 0.8 \
    --min_top_k 2 \
    --max_output_length {Maximum Output Frames to prevent infinite loop} \
    --phone_context_window 3 \

Pretrained checkpoints

  1. Quantizer (put it under quantizer/checkpoints/): here

  2. Transformer (100M version) (put it under ckpt/): model, config