Text to speech models in GGML?

simplejackcoder commented 1 year ago

@ggerganov do you have any interest in producing more models in GGML format?

I'm now convinced your approach of zero dependency, no memory allocation cpu-first ideaology will make it accessible to everyone.

You've got LLAMA, whisper, what is remaining is the reverse of whisper.

What are your thoughts?

simplejackcoder commented 1 year ago

How about using vall-e?

noe commented 1 year ago

How about using vall-e?

AFAIK Microsoft has not released the weights of VALL-E. They just uploaded the paper to arxiv and set up a demo website with some generation samples.

gavsidua commented 1 year ago

@ggerganov I hope you make a text to speech example from cpp

Martin-Laclaustra commented 1 year ago

Here, there is a TTS pytorch model, which has available weights: https://github.com/r9y9/deepvoice3_pytorch I would be particularly interested in the implemented "nyanko" model (described in https://aclanthology.org/2020.lrec-1.789.pdf). There are several stages of pre-processing in python, but if the model can be ported, porting those to c/c++ could be done afterwards. @ggerganov , whats your assessment on the level of difficulty?

flosserblossom commented 1 year ago

UP

ggerganov commented 1 year ago

I'm interested in implementing a TTS using ggml, but don't have capacity atm - there are other priorities. Also, I don't think it is worth implementing a model from 3-4 years ago. It should be SOTA. What is SOTA atm?

VALL-E looks like a good candidate - but no weights.

Martin-Laclaustra commented 1 year ago

VALL-E looks like a good candidate - but no weights.

It seems quite demanding in terms of training data required (60k hours). Aiming to VALL-E X (multilingual) would be the natural choice (this requires 70K hours), but, apparently (paper), tested only for 2 languages by now. I think it is very unlikely that they release the model, and difficult to have a community based one (at least for a breath of languages). Also, it might be also quite demanding for inference (I know ggml is reaching unbelievable achievements by quantizing, etc. but still...).

On the contrary, the one I proposed (nianko) gets acceptable quality with only ~20h (yes, hours!) of training data, and it can be trained for each language in just 3 days on a single GPU (single speaker). I trained models for 3 speakers (1 non-English language). Let me know if you would like to listen to the samples or test the python implementation. Besides, python inference in CPU is already real-time in modern systems. It would really have outstanding performance based on c.

I believe a desirable TTS would be "universal language" direct unicode text to wav converter, but I have not been able to spot such model.

Martin-Laclaustra commented 1 year ago

With respect to VALL-E, there are 2 pytorch unofficial implementations, none of them implement the VALL-E X (multilanguage), and none of them have released the weights (due to ethical concerns?). https://github.com/enhuiz/vall-e https://github.com/lifeiteng/vall-e I do not have details on the weights size or training/inference requirements.

Compare that to a multilingual TTS with lots of available languages: larynx https://github.com/rhasspy/larynx The quality seems a bit lower. But the training work is done. One may wonder what would be the real advantage of using ggml in this case.

Green-Sky commented 1 year ago

they don't provide any code, but

https://speechresearch.github.io/naturalspeech2/ https://arxiv.org/abs/2304.09116

We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.

still more diffusion models ...

ggerganov commented 1 year ago

This looks like the best candidate now: https://github.com/suno-ai/bark

Green-Sky commented 1 year ago

This looks like the best candidate now: https://github.com/suno-ai/bark

by far. since they provide the models. (a bit over 12gig)

Green-Sky commented 1 year ago

This looks like the best candidate now: https://github.com/suno-ai/bark

their voice creation got reverse engineered. https://github.com/serp-ai/bark-with-voice-clone

x066it commented 1 year ago

What about https://github.com/snakers4/silero-models ?

mattkanwisher commented 1 year ago

A new paper came out called Tango looks pretty good, also using LLMs

Green-Sky commented 1 year ago

While tango ~looks~ sounds cool, it's a text-to-audio and not a text-to-speech model.

dennislysenko commented 1 year ago

@ggerganov is there any possibility that Bark, ported to cpp, would be feasible to run on constrained devices like iPhones? Ex. a device with 4GB RAM and a tolerable limit of model size in the low 100s of MB.

Green-Sky commented 1 year ago

@dennislysenko

by far. since they provide the models. (a bit over 12gig)

even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files

dennislysenko commented 1 year ago

@Green-Sky

even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files

Is the quantization referring to the "smaller model" released 05-01?

Green-Sky commented 1 year ago

@dennislysenko no i was talking about ggml, not sure what changes they made in 1.5

dennislysenko commented 1 year ago

@Green-Sky Seems like they refer to smaller model cards as low as 2GB in their README now:

The full version of Bark requires around 12Gb of memory to hold everything on GPU at the same time. However, even smaller cards down to ~2Gb work with some additional settings.

05-01 release notes mention:

We also added an option for a smaller version of Bark, which offers additional speed-up with the trade-off of slightly lower quality.

In theory, could this mean with 4x quantization, it's possible to target ~500MB VRAM?

Green-Sky commented 1 year ago

it's possible to target ~500MB VRAM?

@dennislysenko ggml using vram is very optional. by default ggml only uses ram and cpu. :)

In theory, could this mean with 4x quantization,

their description is very obscure and I dont have the time to look at the code, so maybe

afyacnkep commented 1 year ago

Is there another update about text to speech?

gut4 commented 1 year ago

it's in roadmap now https://github.com/ggerganov/llama.cpp/discussions/1729

ggerganov commented 1 year ago

I personally will look into TTS after finishing the SAM implementation. Maybe someone else is already working on TTS inference

Martin-Laclaustra commented 1 year ago

It seems that the unlocked Bark with voice cloning is here now: https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer This is a necessary step for a complete system. It is worth to have a look into the open and closed issues there to get an overview of the multiple needed models.

Martin-Laclaustra commented 1 year ago

Also, in llama.cpp May 2023 roadmap, a recent comment suggests "drop-in replacement for EnCodec", which may be (or not) easier to implement.

cmp-nct commented 1 year ago

The original Bark did sound artificial to me. The voice cloning repos (both) sound amazing already!

Combine that with llm text and the inference speed we see already .. then we have realtime generative speech output

kskelm commented 1 year ago

it's in roadmap now ggerganov/llama.cpp#1729

That's great news! My only complaint with bark is its speed... your magic touch would be ✨✨✨

Green-Sky commented 1 year ago

there is now a tracking issue for bark https://github.com/ggerganov/ggml/issues/388 which links https://github.com/PABannier/bark.cpp (wip) :partying_face:

TechnotechGit commented 1 year ago

While I'm not an expert by any means, VITS in CoquiTTS is almost realtime on CPU (I tested on a medium range laptop CPU). With ggml and a good quant if possible, could almost certainly be realtime, maybe even playing to speakers in realtime too. Just a thought.

vietanhdev commented 1 year ago

@ggerganov This is a good TTS with C++ code (ONNX Runtime). https://github.com/rhasspy/piper. You can try some generated sample at: https://rhasspy.github.io/piper-samples/.

yorkzero831 commented 1 year ago

https://github.com/Plachtaa/VALL-E-X/blob/master/README.md

how about this？

noe commented 1 year ago

Coqui released their cross-lingual TTS model: XTTS:

It supports 13 languages (Arabic, Brazilian Portuguese, Chinese, Czech, Dutch, English, French, German, Italian, Polish, Russian, Spanish, and Turkish).

It also offers voice cloning and cross-lingual voice cloning.

manmay-nakhashi commented 1 year ago

xtts is based on tortoise-tts https://github.com/neonbjb/tortoise-tts which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text. i think we can convert tortoise-tts autoregrassive model to ggml.

noe commented 1 year ago

xtts is based on tortoise-tts https://github.com/neonbjb/tortoise-tts which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text. i think we can convert tortoise-tts autoregrassive model to ggml.

@manmay-nakhashi do you mean that ggml should be able to handle XTTS out of the box or would it need some adaptations?

manmay-nakhashi commented 1 year ago

it should be able to handle XTTS if ggml supports tortoise-tts, xtts is multilingual tortoise-tts model, we might need different model conversion scripts for xtts that's it.

wassname commented 1 year ago

the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning.

manmay-nakhashi commented 1 year ago

@wassname gpt models can be unpredictable sometimes , fine-tuning on better speaker segmented data can resolve this problem.

manmay-nakhashi commented 1 year ago

I think converting tortoise-tts to ggml makes sense , anyone willing colab on converting tortoise-tts to ggml?

cmp-nct commented 1 year ago

the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho

tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears. It's transformers based, I didn't look into the source but that is likely a good candidate. The only one of today imho. xtts v2 just clones a voice in seconds (more or less closely) and then uses it for any language.

kskelm commented 1 year ago

I’ve soured on bark altogether. Even on their own discord, the thing regularly produces wildly unpredictable output, frequently with a thin resemblance to what you asked for.On Nov 18, 2023, at 7:48 PM, John @.***> wrote:

the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho

tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

cmp-nct commented 1 year ago

I’ve soured on bark altogether. Even on their own discord, the thing regularly produces wildly unpredictable output, frequently with a thin resemblance to what you asked for.On Nov 18, 2023, at 7:48 PM, John @.> wrote: the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

something went wrong with your reply

khimaros commented 1 year ago

coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2

fakerybakery commented 1 year ago

Tortoise please!!!

fakerybakery commented 1 year ago

Please, implement Tortoise instead of XTTS. XTTS is licensed under the ultra-restrictive CPML which completely prohibits ALL commercial use. Please help promote open-source by supporting Tortoise instead.

cmp-nct commented 1 year ago

coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2

Thanks for that info, I'll test that now, the audio samples in their paper is superior to the others though the latest xttsv2 is also almost flawless (but for xtts v2 we'd need a equivalent open source model to be useful). If we go the Tortoise/XTTS route it would be best to make sure to implement the advantages of xtts as well, namely the instant voice cloning and the language independent models.

Evaluating StyleTTS2: 1) The dataset to train is completely open and the steps to train appear very simple 2) The code is MIT 3) They supply models which are completely open with the exception that you need to inform people about StyleTTS2 UNLESS you do have permission by the voice originator which is awesome .. AND it only applies if you don't just train your own. 4) Now testing it

fakerybakery commented 1 year ago

Yes. Perhaps someone could create a "merge" of XTTS and Tortoise, similar to the Tortoise Fast API. For example, using an autoregressive model + hifigan?

fakerybakery commented 1 year ago

StyleTTS looks great. Really hope this gets implemented. Would love to have something similar to llama.cpp that supports many models (tts.cpp?)

cmp-nct commented 1 year ago

StyleTTS looks like a very clean approach, also very good english. But .. it's not multi-lingual at this point. The readme reads like it's a small thing to add more but it doesn't appear that easy when looking closer. If StyleTTS2 would support languages similar to the others, I'd focus on it fully

fakerybakery commented 1 year ago

Been playing around with StyleTTS 2 and it's pretty fast. IMHO it would be better to add Tortoise first since its slower and GGML could have a more significant impact in speeding it up, but StyleTTS 2 is pretty impressive too.

ggerganov / ggml

Text to speech models in GGML? #59