Text to speech models in GGML?

simplejackcoder commented 1 year ago

@ggerganov do you have any interest in producing more models in GGML format?

I'm now convinced your approach of zero dependency, no memory allocation cpu-first ideaology will make it accessible to everyone.

You've got LLAMA, whisper, what is remaining is the reverse of whisper.

What are your thoughts?

eolasd commented 7 months ago

Another upvote here for text-to-speech cpp. Since I am a complete noob with this stuff, could someone give me the high level steps needed for this happen? Does it require GPU time to re-train/quantize models, or is it mostly just writing code to port encoders etc..??

Thanks, and appreciate all the work the community have put in to making this stuff work for the GPU-poor!

fakerybakery commented 7 months ago

@eolasd It shouldn't require GPU time. For example, w/ llama.cpp, you don't need to retrain the models. Probably mostly just porting the Python inference code to C++ and getting the models to work with GGML, right?

eolasd commented 7 months ago

Time for me learn to C++ i think!

MichaelWengren commented 6 months ago

StyleTTS 2 is truly state-of-the-art. Just look at quality and speed https://huggingface.co/spaces/styletts2/styletts2
This is a great candidate for cpp implementation.

cmp-nct commented 6 months ago

Yep it's the best out there, as soon as multilanguage is supported nothing can stop it. The quality is better and the computation requirements are a fraction of before.

MichaelWengren commented 6 months ago

Is anyone working on StyleTTS2.cpp? Support for other languages is important, but even in English it would be an incredibly useful thing, especially the speed at which it works.

kskelm commented 6 months ago

StyleTTS2 really does look amazing. As you note, even the currently “just English” version would be tremendously beneficial. Assumedly, support for other languages is more of a model change than a coding change anyway.On Dec 11, 2023, at 7:43 AM, Michael Wengren @.***> wrote: Is anyone working on StyleTTS2.cpp? Support for other languages is important, but even in English it would be an incredibly useful thing, especially the speed at which it works.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

cmp-nct commented 6 months ago

What I read out of the discussions so far was that training material for multi-language is assembled, someone promised to sponsor 8xA100 for the ~3000 hours training time and that last step is currently open.

fakerybakery commented 6 months ago

Main issue with StyleTTS is it uses IPA phonemes. Right now espeak is the only lib that works with STTS and it's GPL licensed

MichaelWengren commented 6 months ago

Does anyone know of any other high-quality models suitable for realtime use? There is ggml bark implementation https://github.com/PABannier/bark.cpp but it's quite slow

kskelm commented 6 months ago

…and not fully functional yet. It still crashes a bit and generates nonsense phonemes usually.On Dec 11, 2023, at 9:06 AM, Michael Wengren @.***> wrote: Does anyone know of any other high-quality models suitable for realtime use? There is ggml bark implementation https://github.com/PABannier/bark.cpp but it's quite slow

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

manmay-nakhashi commented 6 months ago

I think xtts is the best candidate currently it's multimodal gpt-2 , so it should be relatively easy to port from gpt-2 code which is already implemented.

fakerybakery commented 6 months ago

But not permissively licensed

bachittle commented 6 months ago

bark.cpp is good because it does not require use of a phoneme library, it does everything automatically. StyleTTS 2 has better sounding voices but it requires third party libraries like espeak and some nltk stuff. XTTS is not permissive in license and breaks the idea of building these ggml libraries under MIT.

So best solution for StyleTTS 2 is to do one of the following:

find a more permissively licensed phonemizer
dynamically link to espeak and build it under GPL (so it cannot be in this repository)
build a phonemizer from scratch in C/C++ specifically for this project

fakerybakery commented 6 months ago

Yeah some ppl on the STTS Slack channel are trying to reimplement Phonemizer

balisujohn commented 6 months ago

I am working on tortoise.cpp for a class project. The implementation is underway. I might be able to have it ready by the end of the year, but if people are interested I can open the project to collaborators on Github in a week or so. I will release it with an MIT License.

manmay-nakhashi commented 6 months ago

I have a model conversion and some changes , you can create a repo , I'll contribute over there

balisujohn commented 6 months ago

I will make a cleaner version of my repo public Friday (its currently just a messy fork of ggml with a new folder tortoise in examples) if that sounds good (after the submission deadline for my class). I have a partially developed ggml file format for the model that my code uses, I'm building it tensor by tensor since I'm pretty new to ggml reverse engineering. I'm still working on the autoregressive forward pass, though it looks like I might be able to use a lot of ggml code from the existing ggml gpt2 implementation. I have numbers matching the pytorch forward pass for the text embeddings, which isn't much but it shows that I can load tensors from the ggml file, construct a cgraph, and get the ggml ops to work. I also added a cuda implementation for ggml_concat since I'm using a fork of ggml.

balisujohn commented 6 months ago

@manmay-nakhashi And sounds good regarding collaboration! I posted here precisely to avoid duplicating effort, better to work together than unknowlingly duplicate effort.

Green-Sky commented 6 months ago

The text to speech part of SeamlessM4T, but they implemented Speech-to-text translation (S2TT), Acoustic speech recognition (ASR), Text-to-text translation (T2TT). They have also Speech-to-speech translation (S2ST) and Text-to-speech translation (T2ST) models in the same family, but the ggml implementation for them is still missing. maybe there is hope they implement those too. :) https://github.com/facebookresearch/seamless_communication/tree/main/ggml

cmp-nct commented 6 months ago

Looks interesting too, works quite good but does not produce the same quality as StyleTTS-2 (or xtts2). Seamless is a much bigger project. If they integrate ggml that would certainly be a good thing however their models are not permissive (non commercial).

fakerybakery commented 6 months ago

I am working on tortoise.cpp for a class project. The implementation is underway. I might be able to have it ready by the end of the year, but if people are interested I can open the project to collaborators on Github in a week or so. I will release it with an MIT License.

Thank you! Will it support quantization and Metal?

balisujohn commented 6 months ago

Thank you! Will it support quantization and Metal?

Initially, It's cuda only, but I'm open to merging in whatever people are willing to contribute; the goal is to get an open source project going. Also we can think about adding training and voice cloning etc, but the first goal (and the code I am more interested in writing myself as a starting point) is just getting inference to work for arbitrary text from hardcoded voice latents for the mol voice.

fakerybakery commented 6 months ago

Hmm, makes sense. Will you make a PR to merge your fork of GGML to the main repo?

balisujohn commented 6 months ago

The goal would be to upstream any changes I make to ggml so as not to use a weird version of it. I just forked ggml because the ggml gpt-2 implementation was a really nice template to start from. So far the only upstream change I made was adding a CUDA concatenation kernel because for some reason it was CPU only previously.

balisujohn commented 6 months ago

uh but tortoise.cpp will get its own repo, but the goal will be to keep it's ggml version consistent with normal GGML in the long run.

fakerybakery commented 6 months ago

Hi, is the tortoise.cpp repo public?

balisujohn commented 6 months ago

I technically can't make it public before 9pm today, but I was thinking Friday so I have some time to do some cleanup work. I don't see why I couldn't release it sooner. Would you prefer if I released it sooner than Friday?

fakerybakery commented 6 months ago

Hmm, would be nice to see the WIP project, but Fri works. Thank you for creating this project! Really looking forward to a faster way to run Tortoise

balisujohn commented 6 months ago

Hopefully ends up being useful! Just to level expectations, I want to emphasize the project is nowhere near done, It should be public and ready for contributors by Friday, but will definitely not be anywhere close to a complete forward pass by then.

balisujohn commented 6 months ago

https://github.com/balisujohn/tortoise.cpp

Repository is public, feel free to make an issue on this repo if you want to contribute. I will release the ggml export script and modified tortoise that prints out intermediate values that I'm using for reverse engineering also if people want it.

balisujohn commented 6 months ago

If people are interested in contributing to tortoise.cpp, a great first task would be getting the tokenizer to always match the tokenization tortoise-tts uses. The tokenizer I'm using in tortoise.cpp seems to be able to load the tokenizer vocab but the regex had issues with some of the special chars which I bandaided at least for spaces, but more perplexingly, the tortoise-tts tokenizer isn't greedy with respect to always choosing the longest possible next token, while the default tokenizer I copied from ggml gpt-2 seems to be greedy. So the task would be studying the tokenizer tortoise-tts uses, and modifying the tokenizer in tortoise,cpp to exactly match this behavior. I can also come up with some other tasks to work on.

noe commented 5 months ago

There is a new contender: WhisperSpeech:

An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch.

We want this model to be like Stable Diffusion but for speech – both powerful and easily customizable.

We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications.

kskelm commented 5 months ago

Wow, if this claim is real, I’m excited (emphasis mine):

We spend the last week optimizing inference performance. We integrated torch.compile, added kv-caching and tuned some of the layers – we are now working over 12x faster than real-time on a consumer 4090!

We can mix languages in a single sentence (here the highlighted English project names are seamlessly mixed into Polish speech):

On Jan 25, 2024, at 5:19 AM, Noe Casas @.***> wrote:

There is a new contender: WhisperSpeech https://github.com/collabora/WhisperSpeech:

An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch.

We want this model to be like Stable Diffusion but for speech – both powerful and easily customizable.

We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications.

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/ggml/issues/59#issuecomment-1910092269, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYA34EE7LA3MKETHAOJ4K3YQJEWLAVCNFSM6AAAAAAWPWAF2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGA4TEMRWHE. You are receiving this because you commented.

bachittle commented 5 months ago

Looks like it also uses the Meta encodec library. I think it will be crucial to get a good implementation of this in C++ as I have a feeling that more and more SOTA audio models are going to use this library (bark also comes to mind).

An initial implementation of encodec is found here: https://github.com/PABannier/encodec.cpp

cmp-nct commented 5 months ago

The tiny number of examples in WhisperSpeech is concerning Compare it with that: https://styletts.github.io/

fakerybakery commented 5 months ago

If you try it on Colab it's actually quite good (not as good as XTTS but not bad) but definitely not as fast as StyleTTS

bachittle commented 5 months ago

I think the main thing to consider here is that it does multilingual very well (StyleTTS only does English) and is very similar architecture to whisper so I assume we could borrow content from whisper.cpp.

IcedQuinn commented 4 months ago

Piper uses a VITS model (run through some conversion to ONNX) which runs quite quickly on chunky CPUs. It's not as shiny as some others mentioned here but is quite capable and known to compress well. (StyleTTS considers VITS a close competitor, in their examples.)

balisujohn commented 4 months ago

WhisperSpeech works really well with 10-20 minutes of audio for many voices for zero shot voice cloning. It does a good job on JFK for example if you use this file as input: https://upload.wikimedia.org/wikipedia/commons/5/50/Jfk_rice_university_we_choose_to_go_to_the_moon.ogg The overall quality is less than tortoise though.

If I have time I'm interested in making whisperspeech.cpp, but I'm busy with tortoise.cpp for now :^)

I'd also add, that WhisperSpeech already seems like it could be fast enough that ggml might not improve much over it in terms of speed, but I'm not really sure.

danemadsen commented 3 months ago

For anyone interested this is the current state of Neural Text to Speech for C / C++. I'm currently in search for a decent Neural TTS C or C++ library for integration into my llama.cpp front end (MAID).

rhasspy/piper

Piper is currently the most stable C++ implementation of Neural TTS, it utilities VITS models and depends on ONNX (A similar ML library to GGML). It also requires a custom fork of espeak-ng for phonemization. I can get Piper to compile and run on linux and windows (Though i used my own fork of piper to get it to compile), but actual inference only seems to work on linux. Though it must have worked at one point in time because there's a windows front end that uses it (Piper_UI).

Sherpa Onnx

Sherpa Onnx is very similar to piper. Like its name suggests it utilities the Onnx runtime for inference and like piper it also uses Espeak-NG for phonemization. It has various implementations and API's available and is actively maintained but the monumental size of it makes it difficult to integrate into dart which is a requirement for what i need to do.

I'm preferring to avoid using Piper or Sherpa Onnx for my own project as i would prefer to not be dependent on Espeak-ng or another separate ML library other than GGMl which im already using for the Llama.cpp integration.

Vits.cpp

Vits.cpp is a GGML implementation of VITS models (The same Piper and Sherpa Onnx use). It does not require Espeak-NG and as stated uses GGML and not Onnx. VITS models are good if you have alot of data because they produce very small model files (One model file Ive tested is 60mb). For what i need to do Vits.cpp would be ideal, however, though i can get Vits.cpp to compile it immediately segfaults when launched. It has also not been updated for 3 months so the maintainer has likely abandoned it.

Bark.cpp

Bark.cpp is another GGML implementation but unlike the others it uses the bark series of models released by suno-ai. I haven't tested if it worked but from what Ive seen there's little control over what voice is used and a limited variety of voice presets. There's also no ability to clone voices and because bark is a GPT model the words spoken by the output can be different to the input. Its also worth stating that as of writing Bark.cpp hasn't exactly been receiving frequent updates so it may also be abandoned.

That's it for the C++ implementations I'm aware of if anyone else knows any more let me know.

Purely for Voice cloning models though there's a few available now but they all require python which means they cant be integrated into dart and C++ software well. Off the top of my head theres OpenVoice, WhisperSpeech, StyleTTS, VALL-E, metavoice and the already mentioned bark and vits.

lin72h commented 3 months ago

Seems like recent update of StyleTTS2 is really good TTS Arena

balisujohn commented 1 week ago

I'm pleased to say tortoise.cpp is now ready for public testing. It still is in a pre-alpha sort of state where there's print statements everywhere, but I've confirmed it works on GPU and CPU. (test first with short (4 or 5 word) phrases on GPU, it's not very VRAM efficient.)

https://github.com/balisujohn/tortoise.cpp

ggerganov / ggml