Building a dependable voice

str20tbl commented 4 months ago

Hello! Thank you for sharing this repo it's amazing! I finally have a "sort of" working Welsh voice on mobile 😎

I work at Bangor University developing Welsh language tools and resources, we are looking to improve upon this app and make it a little more dependable at generating speech.

I have trained an x-low quality piper voice and can get longer utterances to work but I'm getting some frutstating errors such as:

[u DA965CED-7534-46E1-BE82-F94878C9C04F:m ] [cymru.techiaith.piperapp.pipertts(1.0)] Connection to plugin invalidated while in use.

VoiceProvider: rendering failure The operation couldn’t be completed. (com.apple.coreaudio.avfaudio error -66749.) didFinish: 0 frameCount:512

Audio Unit encountered an error after sending some valid buffer

I am not an iOS expert unfortunately and there's little to be found from googling these errors...

Would you be able to give me a few pointers as to where to start with this or the general tasks that are going to be needed in order to improve the dependability of the voices?

IhorShevchuk commented 4 months ago

Hello @str20tbl, based on error it might be case when application gets killed while generating speech: https://osstatus.com/search/results?platform=all&framework=all&search=66749

There are numbers of reasons why application can crash(get killed by OS) however I think the most likely one in your case is that it uses too much RAM. In order to reduce memory usage you may want to have a look here: https://github.com/IhorShevchuk/piper-ios-app/issues/1 , where @S-Ali-Zaidi updated model to reduce memory usage.

str20tbl commented 4 months ago

Thanks for the quick reply. I'm using an iPhone 12 mini for testing, is that more likely the issue?

My x_low model is only 20MB and quantised it is 10MB but still I get exactly the same behaviour, where hardly any text can be spoken and often the shortened utterence is repeated.

I am regualarly getting about 4 words of synthesis before it bugs out.

S-Ali-Zaidi commented 4 months ago

Thanks for the quick reply. I'm using an iPhone 12 mini for testing, is that more likely the issue?

My x_low model is only 20MB and quantised it is 10MB but still I get exactly the same behaviour, where hardly any text can be spoken and often the shortened utterence is repeated.

I am regualarly getting about 4 words of synthesis before it bugs out.

It is likely because you are using an iPhone 12 mini. Note that for Piper or any VITS based TTS system, the longer the input length of the text is, the more memory will be needed for the model to run inference on the text.

That is because Piper and other VITS style neural networks create and work with vector embedding for each individual phoneme within the input text.

So if you give a small, quantized model of around 10mb a short phrase or sentence, it’s total RAM usage will likely not go all that much higher. But if it is given longer input sentences, it may end up create vectors for each phoneme within them that results in twice or even more RAM being needed by the system to contain both the model itself as well as all the intermediate representations of the text created by the model in vector form. I’ve seen my Piper fp16 (about 20mb in size) models spiking as high as 500mb in RAM usage on my MacBook when I try to have them process long run on sentences.

And while it seems the constraint on how much RAM iOS allows you to use for Piper within the Audio Unit Extension framework might be a little dynamic and dependent on how much RAM your total system has.

On my Mac with 16GB of RAM -- as well as on my tests using simulations of the iPhone 15 Pro which has 8GB of Ram I think -- there were no issues, I would experience no cut-offs.

But on my real-life tests on my iPhone 13 Pro, as well as on simulations of older iOS devices, I was seeing issues similar to you. On my device, I can give Piper short to medium length sentences without any issue. If given longer, multi-clause sentences, I experience the same issue as you.

The only solution right now is to either use a newer device with more RAM, or to use something like ONNX Sherpa-based Piper implementations, with do not use the Audio Unit Extension framework and thus are free to use much more RAM. The downside is that you will lose the ability to integrate your Piper voice as an iOS system voice, as it will be isolated to use within the app alone.

@IhorShevchuk However, I recently tried the new ElevenLabs Reader iOS app, and they have an interesting implementation -- you simply would select any text (or even any PDF, ePUB, or Webpage) and share it to the Reader App. The app takes the text, processes it server-side, and sends back chunks of audio at a time that is played to you just like it might be played using whatever audio API typical audio players use, such as VLC, Spotify, audible, ETC -- including the ability to pause, rewind, etc. It seems like the entire generated audio is cached until you attempt to have the app run TTS on another text.

It will not allow for system voice implementation.. but I wonder if a similar implementation would at least allow people to have various texts and documents read to them by their on-device Piper model?

str20tbl commented 4 months ago

Thank you both for the considered responses 😀

I've been starting to think along the lines of modifying the piper c++ code to flush the audio buffer at specific sizes so that less data is being sent regarless of the model size. Given that shorter sentences work I thought simply batching via end of sentence would be sufficient with the smaller models.

The large texts issue seems like it should be fixable regardless though, via chunking the text at an appropriate point. I've been trying to split down the utterences but I'm clearly am out of my depth and need to read up on the Apple SDK first. I'll check back with you guys once I make some head way.

IhorShevchuk / piper-ios-app

Building a dependable voice #2