erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
871 stars 99 forks source link

Use Whisper to check the generated audio in AllTalk TTS Generator. #67

Closed Suiyou closed 8 months ago

Suiyou commented 8 months ago

Is your feature request related to a problem? Please describe. I primarily use AllTalk TTS Generator for a large quantity of text, e.g. audiobook generation (for personal use), given the generator has the ability to regenerate a single chunk of text when the audio has problems it's relatively easy to fix misreadings, what's not easy is to track down wich chunck has issues without hearing all the audio generated.

Describe the solution you'd like I don't know if this is out of scope for this project but using Whisper it's possible to check the audio for issues.

Describe alternatives you've considered For example I created an audiobook that resulted in a json file with 5558 entries, but when it finished I only have 5002 wav files. An example of two entries that illustrates why:

{ "id": 5552, "fileUrl": "http://127.0.0.1:7851/audio/TTS_1705084624.wav", "text": "Bonifatius appears in the color art, too.", "characterVoice": "voice.wav", "language": "en" }, { "id": 5553, "fileUrl": "http://127.0.0.1:7851/audio/TTS_1705084624.wav", "text": "It’s raining grandfathers!", "characterVoice": "voice.wav", "language": "en" },

Additional context What I did is kind of a hack solution, I used Whisper to transcribe each wav file (I had to make batches of 1000 files at a time or the Gradio Whisper UI I used would spit an error), then I used a script to check every text entry in the json file.

script: compare.txt

That script let's me know within a certain threshold (specified when I run the command) wich lines I need to regenerate. I decided to use a less that 80% percent similarity and it found 807 entries with issues:

example: ID: 5552, JSON Text: bonifatius appears in the color art, too. TXT Content: it's raining grandfathers.

Differences: Similarity: 26%

I decided to check for those kind of discrepancies because Whisper is not perfect and it has troubles with names for example.

erew123 commented 8 months ago

Hi @Suiyou I'm working on a larger update today, so I've taken a look at this too. I believe I've found what's going on. I noticed that both your sentences to be generated were quite short and as you point out, the filenames were the same. What was not happening was it wasn't carrying the unique ID for files generated at the same time, hence both files generated were generated so quickly, they both got the same timestamp, causing one to over-write the other. It looks as though the UUID was not applying correctly.

So I've re-jigged the code, and ensured the UUID is applying, which means that within any generation, occurring on exactly the same second, there are just over a further 1,048,000 additional file names possible, of which it should always ensure that they remain unique.

This should resolve that problem. Apologies for the issue. I will have the code uploaded to the site a little later today so you will be able to update. Below is a visual of one of the UUID's being applied after the timestamp.

image

Thanks

Suiyou commented 8 months ago

Yeah, I opted to use a chunk size of 1 to avoid the posibility of passing text that's over the character limit (I think it's 250), do you have any suggestion about this?. Your solution should work great for those files that were overwriten in my case. About what I said regarding Whisper, it would provide a way to check for mispoken text, but it may be too much to ask and I don't know how complicated it would be to implement, but it did help me identify various issues more or less quickly. One more thing, I was going through the Generated TTS List and noticed that everytime I regenerate a chunk the list jumps to the last page, I think that it shouldn't be the default behavior. Thanks for everything.

EDIT: another issue I had it's when trying to generate audio in japanese, if I try generating a wav chunk it just doesn't show any progress in the terminal, and if I try streaming it says:

ERROR: Exception in ASGI application Traceback (most recent call last): File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 259, in call await wrap(partial(self.listen_for_disconnect, receive)) File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 255, in wrap await func() File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 232, in listen_for_disconnect message = await receive() ^^^^^^^^^^^^^^^ File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 538, in receive await self.message_event.wait() File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\asyncio\locks.py", line 213, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 1f9efa54490

During handling of the above exception, another exception occurred:

erew123 commented 8 months ago

Yeah, I opted to use a chunk size of 1 to avoid the posibility of passing text that's over the character limit (I think it's 250), do you have any suggestion about this?. Usually a sentence chunk size of 2-3 is fine. I normally go with 2. Its not an absolute that something over 250 characters will cause an issue, just more likely. 250 characters is quite a lot though. And its more to do with individual sentences breaching that. Its complicated, but Id say you should be ok with 2. If you see it complaining at the command prompt about the 250 character length, then that's actually the model complaining and not AllTalk. If you don't see it complaining then things are splitting down ok . There's no absolute way to split things down without it breaking the pronunciation etc.

About what I said regarding Whisper, it would provide a way to check for misspoken text, but it may be too much to ask and I don't know how complicated it would be to implement, but it did help me identify various issues more or less quickly. Potentially in future, but not immediately. I've just made a lot of changes to the code and I need to sit on that for a while and see if anything crops up before I dive into other things. So perhaps as a future possibility. Ill make a note of it.

One more thing, I was going through the Generated TTS List and noticed that everytime I regenerate a chunk the list jumps to the last page, I think that it shouldn't be the default behavior. Will see what I can do.. I hated that bit of code and was glad to see the back of it.... but ill see what I can do at some point!.

another issue I had it's when trying to generate audio in japanese, if I try generating a wav chunk it just doesn't show any progress in the terminal, and if I try streaming it says: Your system doesnt have a native Japanese character set Im guessing. In your Python environment, at the prompt do:

pip install cutlet>=0.3.0

pip install unidic-lite>=1.0.8

Or just git pull an update and install the requirements file again. And that should resolve it. As they are only small installs, like mb's, I've added them to the installer requirements files.

Suiyou commented 8 months ago

I know that the character limit is not actually rigid but I just wanted to get the least amount of errors, in fact using a size chunk of 2, the third chunk it generates for the current book I'm trying, throws that warning. Well, nothing to do about it for now.

Potentially in future, but not immediately. I've just made a lot of changes to the code and I need to sit on that for a while and see if anything crops up before I dive into other things. So perhaps as a future possibility. Ill make a note of it.

Thanks, I just updated to the latest version, and it's currently running the same book to test it. So far it seems to be saving every file correctly.

Will see what I can do.. I hated that bit of code and was glad to see the back of it.... but ill see what I can do at some point!.

Sorry, just something I noticed while trying to fix the files.

I'll try what you suggested regarding japanese text. Thanks once again, bye.