Open simplejackcoder opened 1 year ago
How about using vall-e?
How about using vall-e?
AFAIK Microsoft has not released the weights of VALL-E. They just uploaded the paper to arxiv and set up a demo website with some generation samples.
@ggerganov I hope you make a text to speech example from cpp
Here, there is a TTS pytorch model, which has available weights: https://github.com/r9y9/deepvoice3_pytorch I would be particularly interested in the implemented "nyanko" model (described in https://aclanthology.org/2020.lrec-1.789.pdf). There are several stages of pre-processing in python, but if the model can be ported, porting those to c/c++ could be done afterwards. @ggerganov , whats your assessment on the level of difficulty?
UP
I'm interested in implementing a TTS using ggml
, but don't have capacity atm - there are other priorities.
Also, I don't think it is worth implementing a model from 3-4 years ago. It should be SOTA.
What is SOTA atm?
VALL-E looks like a good candidate - but no weights.
VALL-E looks like a good candidate - but no weights.
It seems quite demanding in terms of training data required (60k hours). Aiming to VALL-E X (multilingual) would be the natural choice (this requires 70K hours), but, apparently (paper), tested only for 2 languages by now. I think it is very unlikely that they release the model, and difficult to have a community based one (at least for a breath of languages). Also, it might be also quite demanding for inference (I know ggml is reaching unbelievable achievements by quantizing, etc. but still...).
On the contrary, the one I proposed (nianko) gets acceptable quality with only ~20h (yes, hours!) of training data, and it can be trained for each language in just 3 days on a single GPU (single speaker). I trained models for 3 speakers (1 non-English language). Let me know if you would like to listen to the samples or test the python implementation. Besides, python inference in CPU is already real-time in modern systems. It would really have outstanding performance based on c.
I believe a desirable TTS would be "universal language" direct unicode text to wav converter, but I have not been able to spot such model.
With respect to VALL-E, there are 2 pytorch unofficial implementations, none of them implement the VALL-E X (multilanguage), and none of them have released the weights (due to ethical concerns?). https://github.com/enhuiz/vall-e https://github.com/lifeiteng/vall-e I do not have details on the weights size or training/inference requirements.
Compare that to a multilingual TTS with lots of available languages: larynx https://github.com/rhasspy/larynx The quality seems a bit lower. But the training work is done. One may wonder what would be the real advantage of using ggml in this case.
they don't provide any code, but
https://speechresearch.github.io/naturalspeech2/ https://arxiv.org/abs/2304.09116
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
still more diffusion models ...
This looks like the best candidate now: https://github.com/suno-ai/bark
This looks like the best candidate now: https://github.com/suno-ai/bark
by far. since they provide the models. (a bit over 12gig)
This looks like the best candidate now: https://github.com/suno-ai/bark
their voice creation got reverse engineered. https://github.com/serp-ai/bark-with-voice-clone
What about https://github.com/snakers4/silero-models ?
A new paper came out called Tango looks pretty good, also using LLMs
While tango ~looks~ sounds cool, it's a text-to-audio and not a text-to-speech model.
@ggerganov is there any possibility that Bark, ported to cpp, would be feasible to run on constrained devices like iPhones? Ex. a device with 4GB RAM and a tolerable limit of model size in the low 100s of MB.
@dennislysenko
by far. since they provide the models. (a bit over 12gig)
even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files
@Green-Sky
even with the best "4bit" quantization imaginable, you still end up with ~3gigs for model files
Is the quantization referring to the "smaller model" released 05-01?
@dennislysenko no i was talking about ggml, not sure what changes they made in 1.5
@Green-Sky Seems like they refer to smaller model cards as low as 2GB in their README now:
The full version of Bark requires around 12Gb of memory to hold everything on GPU at the same time. However, even smaller cards down to ~2Gb work with some additional settings.
05-01 release notes mention:
We also added an option for a smaller version of Bark, which offers additional speed-up with the trade-off of slightly lower quality.
In theory, could this mean with 4x quantization, it's possible to target ~500MB VRAM?
it's possible to target ~500MB VRAM?
@dennislysenko ggml using vram is very optional. by default ggml only uses ram and cpu. :)
In theory, could this mean with 4x quantization,
their description is very obscure and I dont have the time to look at the code, so maybe
Is there another update about text to speech?
it's in roadmap now https://github.com/ggerganov/llama.cpp/discussions/1729
I personally will look into TTS after finishing the SAM implementation. Maybe someone else is already working on TTS inference
It seems that the unlocked Bark with voice cloning is here now: https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer This is a necessary step for a complete system. It is worth to have a look into the open and closed issues there to get an overview of the multiple needed models.
Also, in llama.cpp May 2023 roadmap, a recent comment suggests "drop-in replacement for EnCodec", which may be (or not) easier to implement.
The original Bark did sound artificial to me. The voice cloning repos (both) sound amazing already!
Combine that with llm text and the inference speed we see already .. then we have realtime generative speech output
it's in roadmap now ggerganov/llama.cpp#1729
That's great news! My only complaint with bark is its speed... your magic touch would be ✨✨✨
there is now a tracking issue for bark https://github.com/ggerganov/ggml/issues/388 which links https://github.com/PABannier/bark.cpp (wip) :partying_face:
While I'm not an expert by any means, VITS in CoquiTTS is almost realtime on CPU (I tested on a medium range laptop CPU). With ggml and a good quant if possible, could almost certainly be realtime, maybe even playing to speakers in realtime too. Just a thought.
@ggerganov This is a good TTS with C++ code (ONNX Runtime). https://github.com/rhasspy/piper. You can try some generated sample at: https://rhasspy.github.io/piper-samples/.
https://github.com/Plachtaa/VALL-E-X/blob/master/README.md
how about this?
Coqui released their cross-lingual TTS model: XTTS:
It supports 13 languages (Arabic, Brazilian Portuguese, Chinese, Czech, Dutch, English, French, German, Italian, Polish, Russian, Spanish, and Turkish).
It also offers voice cloning and cross-lingual voice cloning.
xtts is based on tortoise-tts https://github.com/neonbjb/tortoise-tts which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text. i think we can convert tortoise-tts autoregrassive model to ggml.
xtts is based on tortoise-tts https://github.com/neonbjb/tortoise-tts which uses modified gpt-2. embedding is different and it has two heads one for mel and one for text. i think we can convert tortoise-tts autoregrassive model to ggml.
@manmay-nakhashi do you mean that ggml should be able to handle XTTS out of the box or would it need some adaptations?
it should be able to handle XTTS if ggml supports tortoise-tts, xtts is multilingual tortoise-tts model, we might need different model conversion scripts for xtts that's it.
the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning.
@wassname gpt models can be unpredictable sometimes , fine-tuning on better speaker segmented data can resolve this problem.
I think converting tortoise-tts to ggml makes sense , anyone willing colab on converting tortoise-tts to ggml?
the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho
tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears. It's transformers based, I didn't look into the source but that is likely a good candidate. The only one of today imho. xtts v2 just clones a voice in seconds (more or less closely) and then uses it for any language.
I’ve soured on bark altogether. Even on their own discord, the thing regularly produces wildly unpredictable output, frequently with a thin resemblance to what you asked for.On Nov 18, 2023, at 7:48 PM, John @.***> wrote:
the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho
tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
I’ve soured on bark altogether. Even on their own discord, the thing regularly produces wildly unpredictable output, frequently with a thin resemblance to what you asked for.On Nov 18, 2023, at 7:48 PM, John @.> wrote: the problem I had with tortise (and bark) was that it wasn't consistent across inferences. E.g. it would change voices, even with the same conditioning. That's just a implementation issue, look at that python package it's a nightmare imho tortoise, especially the latest v2 of xtts is producing very good results. In many cases flawless speech to my ears. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>
something went wrong with your reply
coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2
Tortoise please!!!
Please, implement Tortoise instead of XTTS. XTTS is licensed under the ultra-restrictive CPML which completely prohibits ALL commercial use. Please help promote open-source by supporting Tortoise instead.
coincidentally, this was just released claiming SOTA: https://github.com/yl4579/StyleTTS2
Thanks for that info, I'll test that now, the audio samples in their paper is superior to the others though the latest xttsv2 is also almost flawless (but for xtts v2 we'd need a equivalent open source model to be useful). If we go the Tortoise/XTTS route it would be best to make sure to implement the advantages of xtts as well, namely the instant voice cloning and the language independent models.
Evaluating StyleTTS2: 1) The dataset to train is completely open and the steps to train appear very simple 2) The code is MIT 3) They supply models which are completely open with the exception that you need to inform people about StyleTTS2 UNLESS you do have permission by the voice originator which is awesome .. AND it only applies if you don't just train your own. 4) Now testing it
Yes. Perhaps someone could create a "merge" of XTTS and Tortoise, similar to the Tortoise Fast API. For example, using an autoregressive model + hifigan?
StyleTTS looks great. Really hope this gets implemented. Would love to have something similar to llama.cpp that supports many models (tts.cpp?)
StyleTTS looks like a very clean approach, also very good english. But .. it's not multi-lingual at this point. The readme reads like it's a small thing to add more but it doesn't appear that easy when looking closer. If StyleTTS2 would support languages similar to the others, I'd focus on it fully
Been playing around with StyleTTS 2 and it's pretty fast. IMHO it would be better to add Tortoise first since its slower and GGML could have a more significant impact in speeding it up, but StyleTTS 2 is pretty impressive too.
@ggerganov do you have any interest in producing more models in GGML format?
I'm now convinced your approach of zero dependency, no memory allocation cpu-first ideaology will make it accessible to everyone.
You've got LLAMA, whisper, what is remaining is the reverse of whisper.
What are your thoughts?