c-frame / sponsorship

Link to issues outside the c-frame organization that need sponsors
https://github.com/orgs/c-frame/projects/2/views/1
6 stars 0 forks source link

[coqui-tts] production deployment and aframe component #9

Open vincentfretin opened 11 months ago

vincentfretin commented 11 months ago

I'm currently using the speechSynthesis api for text to speech, but this api doesn't work in VR on Meta browser. Also the voice is different from one platform to another, using a male voice on a female avatar is funny but not for a customer :-) The api is a bit tricky with the voices list that is async, you can read more on this article (7 dec. 2021, so some information may not be accurate anymore)

I'm working on a coqui cpu integration with the official docker image, integrating it to my existing server without GPU. The "tts_models/multilingual/multi-dataset/your_tts" model (article) is actually quite good for English and French (That's funny for French that you have a enough good result with speaker_id="male-pt-3\n" and language_id="fr-fr")

The backend part will consist of a docker-compose file and one or several docker containers to generate the audio from text suitable for a production usage (several users communicating with a gpt-3.5 agent at the same time in different rooms):

I'm working also on an aframe component that split the text on punctuation into chunks, does the fetch call for each chunk to the coqui tts service and play the audio chunk sequentially. For the fetch call and playing the audio file received, see their code

I'm working on it for my current project. When I'm done implementing it, I'll open source it in a private repo with instructions how to self host it and use the aframe component for my $10 tier monthly sponsors. The access to the repo will be public 4 months later.

Resources:

Alternatives:

vincentfretin commented 11 months ago

Also see:

vincentfretin commented 11 months ago

An example for French with the following text: "Ici, vous pouvez voir trois modules actuellement fermés. Le module blanc est le réfectoire, qui sert pour le déjeuner. Dans le module orange, vous trouverez la machine à café, ainsi que les vestiaires et les casiers. Enfin, le module bleu est destiné aux toilettes et aux douches." with "tts_models/multilingual/multi-dataset/your_tts" model, speaker_id="male-pt-3\n" and language_id="fr-fr"

https://github.com/c-frame/sponsorship/assets/112249/8dfe093a-86eb-4e9a-a812-29afec0cbbd9 (sound only, not a video)

KooIaIa commented 11 months ago

I would love to help sponsor this work - I'll look into GitHub's system. I am a huge fan of open source speech tech - my favorite right now is Festival Lite + Wasm but the ecosystem isnt there yet in WASM today for audio output.

Would your method work as a open source cross platform polyfill for Speech Synth JavaScript support?

KooIaIa commented 11 months ago

Does the output of this work require a server?

vincentfretin commented 11 months ago

You will need a server running docker, a small VPS with Ubuntu 22.04 for example. It will work everywhere, it's just a fetch call to the hosted coqui tts webservice, then it reads the downloaded wav file with an audio element.

Currently on Meta browser on Quest 1 (it's not updated anymore) window.speechSynthesis and window.SpeechSynthesisUtterance are undefined. So I guess yes we can do a polyfill. I can go in that direction. But I'll also write an alternative api so you can force it to always go through the coqui server even if window.speechSynthesis is defined.

With speechSynthesis api I'm currently using the speechSynthesis.onvoiceschanged callback to select the preferred voice and the speechSynthesis.speaking flag to know if it's currently speaking and to speak:

const msg = new SpeechSynthesisUtterance(chunk);
msg.pitch = this.data.pitch;
msg.rate = this.data.rate;
msg.voice = this.voice;
speechSynthesis.speak(msg);

Pitch and rate wouldn't do anything in the polyfill implementation that would go through coqui tts. I see there is other flags in the speechSynthesis api, like pending and paused and some other methods that I'm not currently using. Do you use a specific part of the API I haven't listed above?

KooIaIa commented 11 months ago

Thank you for the detailed explanation - I'm looking for a polyfill capable solution to add SpeechSynth to any website in a free and open source way and a Ubuntu Linux server based solution will not provide that (it requires two computers). I'm designing for local-fist offline-capable-AR-VR websites. I totally see the benefits of speech running on a seperate computer (like if it is truely a AI entity) and there are many great server-based solutions today that are already free and open source. If these AI models are what your personally aiming for then maybe they will be able to run locally on WebGPU in a year or two cheaply and our goals may cross.

It looks like your closely following the pattern of the standard javascript api. speak(Utterance) yeah and its neat your explanding the api for yourself. Thanks for explaining again - your open source activity is wonderful and this sponsorship system you have is neat. If you do work like this one day that works in this domain locally I'm excited to sponser it. Hopefully new headsets like the Quest 3 or Deckard will have some extra juice to do TTS in a web worker.

vincentfretin commented 11 months ago

Thanks for the kind word. You should keep an eye to the implementation of the tts bark model running with ggml. You normally will be able to run it with WebAssembly like with whisper.cpp.

KooIaIa commented 11 months ago

Thanks. that looks like the scale I'm aiming for! Maybe we will end up with a good open source ecosystem that can run locally or remotely. Artificial / Virtual Speech is so fundamentally important for so many things - AI communication or even just humans wanting to read a book with their ears instead of their eyes. Im very excited to see how this space grows and becomes standard and friendly - thanks for the ggml model tip!

vincentfretin commented 11 months ago

The YourTTS model says really weird things in French when there is ":" or "?" in the text. :D And also it doesn't know how to pronounce abbreviation like "etc.". So I cheat:

text = text.replaceAll(": ", "").replaceAll("?", "").replaceAll("etc.", "ète cétéra");
vincentfretin commented 11 months ago

With the talking mistakes: yourtts_example1.webm

With the line removing punctuation and forcing how to pronounce "etc.": yourtts_example2.webm

I'll be testing now new voices with the new YourTTS checkpoint or generate my own.

vincentfretin commented 3 months ago

You may know that Coqui company shut down, just after they allowed to use xtts output commercially with a yearly license. So currently you can't purchase a license to use the xtts model commercially. That's unfortunate.

You can still use the inference engine (its MPL2 license) and XTTS model (CPML, so can't use commercially) though. FYI there was a streaming server for xtts https://github.com/coqui-ai/xtts-streaming-server

Today I'm using openai tts voices https://platform.openai.com/docs/guides/text-to-speech, works well in English and French with the same voice. I'm using it through a small nodejs process with fastify framework.

vincentfretin commented 2 months ago

Open source TTS Piper: https://rhasspy.github.io/piper-samples/ for English en_US-amy-medium.onnx is great for French fr_FR-upmc-medium.onnx jessica and pierre is best inference speed is fast. I think it's finally a real open source alternative to Google TTS or OpenAI TTS for French. I'll do more testing with it.