Camb-ai / MARS5-TTS

MARS5 speech model (TTS) from CAMB.AI
https://www.camb.ai
GNU Affero General Public License v3.0
1.37k stars 95 forks source link
prosody speech speech-synthesis text-to-speech voice-cloneai voice-cloning

MARS5: A novel speech model for insane prosody

![MARS5 Banner](assets/github-banner.png)

Why MARS5? | Model Architecture | Samples | Camb AI Website

[![GitHub Repo stars](https://img.shields.io/github/stars/Camb-ai/MARS5-TTS?style=social)](https://github.com/Camb-ai/MARS5-TTS/stargazers) [![Join our Discord](https://discordapp.com/api/guilds/1107565548864290840/widget.png)](https://discord.gg/FFQNCSKSXX) [![HuggingFace badge](https://img.shields.io/badge/%F0%9F%A4%97HuggingFace-Join-yellow)](https://huggingface.co/CAMB-AI/MARS5-TTS) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Camb-ai/mars5-tts/blob/master/mars5_demo.ipynb)

Approach

This is the repo for the MARS5 English speech model (TTS) from CAMB.AI.

The model follows a two-stage AR-NAR pipeline with a distinctively novel NAR component (see more info in the Architecture).

With just 5 seconds of audio and a snippet of text, MARS5 can generate speech even for prosodically hard and diverse scenarios like sports commentary, anime and more. Check out our demo:

https://github.com/Camb-ai/MARS5-TTS/assets/23717819/3e191508-e03c-4ff9-9b02-d73ae0ebefdd

Watch full video here: Youtube

Mars 5 simplified diagram

Figure: The high-level architecture flow of MARS5. Given text and a reference audio, coarse (L0) encodec speech features are obtained through an autoregressive transformer model. Then, the text, reference, and coarse features are refined in a multinomial DDPM model to produce the remaining encodec codebook values. The output of the DDPM is then vocoded to produce the final audio.

Because the model is trained on raw audio together with byte-pair-encoded text, it can be steered with things like punctuation and capitalization. E.g. To add a pause, add a comma to that part in the transcript. Or, to emphasize a word, put it in capital letters in the transcript. This enables a fairly natural way for guiding the prosody of the generated output.

Speaker identity is specified using an audio reference file between 2-12 seconds, with lengths around 6s giving optimal results. Further, by providing the transcript of the reference, MARS5 enables one to do a 'deep clone' which improves the quality of the cloning and output, at the cost of taking a bit longer to produce the audio. For more details on this and other performance and model details, please see the docs folder.

Quick links

Quickstart

We use torch.hub to make loading the model easy -- no cloning of the repo needed. The steps to perform inference are simple:

  1. Installation using pip:

    Requirements:

    • Python >= 3.10
    • Torch >= 2.0
    • Torchaudio
    • Librosa
    • Vocos
    • Encodec
    • safetensors
    • regex
pip install --upgrade torch torchaudio librosa vocos encodec safetensors regex
  1. Load models: load the MARS5 AR and NAR model from torch hub:
import torch, librosa

mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)
# The `mars5` contains the AR and NAR model, as well as inference code.
# The `config_class` contains tunable inference config settings like temperature.

(Optional) Load Model from huggingface (make sure repository is cloned)

from inference import Mars5TTS, InferenceConfig as config_class
import torch, librosa

mars5 = Mars5TTS.from_pretrained("CAMB-AI/MARS5-TTS")
  1. Pick a reference and optionally its transcript:
# Load reference audio between 1-12 seconds.
wav, sr = librosa.load('<path to arbitrary 24kHz waveform>.wav',
                       sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "<transcript of the reference audio>"

Note: The reference transcript is optional. Pass it if you wish to do a deep clone.

MARS5 supports 2 kinds of inference: a shallow, fast inference whereby you do not need the transcript of the reference (we call this a shallow clone), and a second slower, but typically higher quality way, which we call a deep clone. To use the deep clone, you need the prompt transcript. See the model architecture for more info on this.

  1. Perform the synthesis:
# Pick whether you want a deep or shallow clone. Set to False if you don't know prompt transcript or want fast inference. Set to True if you know transcript and want highest quality.
deep_clone = True
# Below you can tune other inference settings, like top_k, temperature, top_p, etc...
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100,
                      top_k=100, temperature=0.7, freq_penalty=3)

ar_codes, output_audio = mars5.tts("The quick brown rat.", wav,
          ref_transcript,
          cfg=cfg)
# output_audio is (T,) shape float tensor corresponding to the 24kHz output audio.

That's it! These default settings provide pretty good results, but feel free to tune the inference settings to optimize the output for your particular usecase. See the InferenceConfig code or the demo notebook for info and docs on all the different inference settings.

Some tips for best quality:

Or Use Docker

Pull from DockerHub

You can directly pull the docker image from our DockerHub page.

Build On Your Own

You can build a custom image from the provided Dockerfile in this repo by running the following command.

cd MARS5-TTS
docker build -t mars5ttsimage ./docker

Note: This image should be used as a base image on top of which you can add your custom inference script in a Dockerfile or docker-compose. Images that directly generate output will be added to Docker Hub and as Dockerfiles in this repo soon

Model Details

Checkpoints

The checkpoints for MARS5 are provided under the releases tab of this github repo. We provide two checkpoints:

The checkpoints are provided as both pytorch .pt checkpoints, and safetensors .safetensors checkpoints. By default, the torch.hub.load() loads the safetensors version, but you can specify which version of checkpoint you prefer with the ckpt_format='safetensors' or ckpt_format='pt' argument the in torch.hub.load() call. E.g. to force safetensors format:

torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', ckpt_format='safetensors')

Or to force pytorch .pt format when loading the checkpoints:

torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', ckpt_format='pt')

Hardware Requirements:

You must be able to store at least 750M+450M params on GPU, and do inference with 750M of active parameters.

If you do not have the necessary hardware requirements and just want to use MARS5 in your applications, you can use it via our API. If you need some extra credits to test it for your use case, feel free to reach out to help@camb.ai.

Roadmap and tasks

MARS5 is not perfect at the moment, and we are working on improving its quality, stability, and performance. Rough areas we are looking to improve, and welcome any contributions in:

Specific tasks

If you would like to contribute any improvement to MARS5, please feel free to contribute (guidelines below).

Contributions

We welcome any contributions to improving the model. As you may find when experimenting, it can produce really great results, it can still be further improved to create excellent outputs consistently. We'd also love to see how you used MARS5 in different scenarios, please use the 🙌 Show and tell category in Discussions to share your examples.

Contribution format:

The preferred way to contribute to our repo is to fork the master repository on GitHub:

  1. Fork the repo on github
  2. Clone the repo, set upstream as this repo: git remote add upstream git@github.com:Camb-ai/mars5-tts.git
  3. Make a new local branch and make your changes, commit changes.
  4. Push changes to new upstream branch: git push --set-upstream origin <NAME-NEW-BRANCH>
  5. On github, go to your fork and click 'Pull Request' to begin the PR process. Please make sure to include a description of what you did/fixed.

License

We are open-sourcing MARS5 in English under GNU AGPL 3.0, but you can request to use it under a different license by emailing help@camb.ai.

Join Our Team

We're an ambitious team, globally distributed, with a singular aim of making everyone's voice count. At CAMB.AI, we're a research team of Interspeech-published, Carnegie Mellon, ex-Siri engineers and we're looking for you to join our team.

We're actively hiring; please drop us an email at ack@camb.ai if you're interested. Visit our careers page for more info.

Community

Join CAMB.AI community on Forum and Discord to share any suggestions, feedback, or questions with our team.

Acknowledgements

Parts of code for this project are adapted from the following repositories -- please make sure to check them out! Thank you to the authors of: