e-c-k-e-r / vall-e

An unofficial PyTorch implementation of VALL-E
GNU Affero General Public License v3.0
75 stars 7 forks source link
audio-lm pytorch text-to-speech tts vall-e

VALL'E

An unofficial PyTorch implementation of VALL-E, utilizing the EnCodec encoder/decoder.

Requirements

Besides a working PyTorch environment, the only hard requirement is espeak-ng for phonemizing text:

Support on AMD systems with ROCm is mostly supported, but performance will vary.

Install

Simply run pip install git+https://git.ecker.tech/mrq/vall-e or pip install git+https://github.com/e-c-k-e-r/vall-e.

I've tested this repo under Python versions 3.10.9, 3.11.3, and 3.12.3.

Pre-Trained Model

My pre-trained weights can be acquired from here.

A script to setup a proper environment and download the weights can be invoked with ./scripts/setup.sh. This will automatically create a venv, and download the ar+nar-llama-8 weights and config file to the right place.

When inferencing, either through the web UI or CLI, if no model is passed, the default model will download automatically instead, and should automatically update.

Train

Training is very dependent on:

Try Me

To quickly test if a configuration works, you can run python -m vall_e.models.ar_nar --yaml="./data/config.yaml"; a small trainer will overfit a provided utterance.

Leverage Your Own Dataset

If you already have a dataset you want, for example, your own large corpus or for finetuning, you can use your own dataset instead.

  1. Set up a venv with https://github.com/m-bain/whisperX/.

    • At the moment only WhisperX is utilized. Using other variants like faster-whisper is an exercise left to the user at the moment.
    • It's recommended to use a dedicated virtualenv specifically for transcribing, as WhisperX will break a few dependencies.
    • The following command should work:
      python3 -m venv venv-whisper
      source ./venv-whisper/bin/activate
      pip3 install torch torchvision torchaudio
      pip3 install git+https://github.com/m-bain/whisperX/
  2. Populate your source voices under ./voices/{group name}/{speaker name}/.

  3. Run python3 -m vall_e.emb.transcribe. This will generate a transcription with timestamps for your dataset.

    • If you're interested in using a different model, edit the script's model_name and batch_size variables.
  4. Run python3 -m vall_e.emb.process. This will phonemize the transcriptions and quantize the audio.

    • If you're using a Descript-Audio-Codec based model, ensure to set the sample rate and audio backend accordingly.
  5. Run python3 -m vall_e.emb.similar. This will calculate the top-k most similar utterances for each utterance for use with sampling.

    • Doing this will help the model follow the input prompt stronger, at the possible "cost" of the model not learning how to "infer" the target speaker AND prosidy.
  6. Copy ./data/config.yaml to ./training/config.yaml. Customize the training configuration and populate your dataset.training list with the values stored under ./training/dataset/list.json.

    • Refer to ./vall_e/config.py for additional configuration details.

Dataset Formats

Two dataset formats are supported:

Training

For single GPUs, simply running python3 -m vall_e.train --yaml="./training/config.yaml.

For multiple GPUs, or exotic distributed training:

You can enter save to save the state at any time, or quit to save and quit training.

The lr command will also let you adjust the learning rate on the fly. For example: lr 1.0e-3 will set the learning rate to 0.001.

Some additional flags can be passed as well:

Finetuning

Finetuning can be done by training the full model, or using a LoRA.

Finetuning the full model is done the same way as training a model, but be sure to have the weights in the correct spot, as if you're loading them for inferencing.

For training a LoRA, add the following block to your config.yaml:

loras:
- name : "arbitrary name" # whatever you want
  rank: 128 # dimensionality of the LoRA
  alpha: 128 # scaling factor of the LoRA
  training: True

And that's it. Training of the LoRA is done with the same command. Depending on the rank and alpha specified, the loss may be higher than it should, as the LoRA weights are initialized to appropriately random values. I found rank and alpha of 128 works fine.

To export your LoRA weights, run python3 -m vall_e.export --lora --yaml="./training/config.yaml". You should be able to have the LoRA weights loaded from a training checkpoint automagically for inferencing, but export them just to be safe.

Plotting Metrics

Included is a helper script to parse the training metrics. Simply invoke it with, for example: python3 -m vall_e.plot --yaml="./training/config.yaml"

You can specify what X and Y labels you want to plot against by passing --xs tokens_processed --ys loss.nll stats.acc

Notices

Training Under Windows

As training under deepspeed and Windows is not (easily) supported, under your config.yaml, simply change trainer.backend to local to use the local training backend.

Creature comforts like float16, amp, and multi-GPU training should work under the local backend, but extensive testing still needs to be done to ensure it all functions.

Backend Architectures

As the core of VALL-E makes use of a language model, various LLM architectures can be supported and slotted in. Currently supported LLM architectures:

For audio backends:

llama-based models also support different attention backends:

The wide support for various backends is solely while I try and figure out which is the "best" for a core foundation model.

ROCm Flash Attention

ROCm/flash-attention currently does not support Navi3 cards (gfx11xx), so first-class support for Flash Attention is a bit of a mess on Navi3. Using the howiejay/navi_support branch can get inference support, but not training support (due to some error being thrown during the backwards pass) by:

Export

To export the models, run: python -m vall_e.export --yaml=./training/config.yaml.

This will export the latest checkpoints, for example, under ./training/ckpt/ar+nar-retnet-8/fp32.pth, to be loaded on any system with PyTorch, and will include additional metadata, such as the symmap used, and training stats.

Desite being called fp32.pth, you can export it to a different precision type with --dtype=float16|bfloat16|float32.

You can also export to safetensors with --format=sft, and fp32.sft will be exported instead.

Synthesis

To synthesize speech: python -m vall_e <text> <ref_path> <out_path> --yaml=<yaml_path> (or --model=<model_path>)

Some additional flags you can pass are:

And some experimental sampling flags you can use too (your mileage will definitely vary, but most of these are bandaids for a bad AR):

Speech-to-Text

The ar+nar-tts+stt-llama-8 model has received additional training for a speech-to-text task against EnCodec-encoded audio.

Currently, the model only transcribes back into the IPA phonemes it was trained against, as an additional model or external program is required to translate the IPA phonemes back into text.

Web UI

A Gradio-based web UI is accessible by running python3 -m vall_e.webui. You can, optionally, pass:

Emergent Behavior

The model can be prompted in creative ways to yield some interesting behaviors:

Inference

Synthesizing speech is simple:

All the additional knobs have a description that can be correlated to the above CLI flags.

Speech-To-Text phoneme transcriptions for models that support it can be done using the Speech-to-Text tab.

Dataset

This tab currently only features exploring a dataset already prepared and referenced in your config.yaml. You can select a registered voice, and have it randomly sample an utterance.

In the future, this should contain the necessary niceties to process raw audio into a dataset to train/finetune through, without needing to invoke the above commands to prepare the dataset.

Settings

So far, this only allows you to load a different model without needing to restart. The previous model should seamlessly unload, and the new one will load in place.

To-Do

Caveats

Despite how lightweight it is in comparison to other TTS's I've meddled with, there are still some caveats, be it with the implementation or model weights:

Notices and Citations

Unless otherwise credited/noted in this README or within the designated Python file, this repository is licensed under AGPLv3.

@article{wang2023neural,
  title={Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
  author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
  journal={arXiv preprint arXiv:2301.02111},
  year={2023}
}
@article{defossez2022highfi,
  title={High Fidelity Neural Audio Compression},
  author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  journal={arXiv preprint arXiv:2210.13438},
  year={2022}
}