Use Whisper to check the generated audio in AllTalk TTS Generator.

Is your feature request related to a problem? Please describe. I primarily use AllTalk TTS Generator for a large quantity of text, e.g. audiobook generation (for personal use), given the generator has the ability to regenerate a single chunk of text when the audio has problems it's relatively easy to fix misreadings, what's not easy is to track down wich chunck has issues without hearing all the audio generated.

Describe the solution you'd like I don't know if this is out of scope for this project but using Whisper it's possible to check the audio for issues.

Describe alternatives you've considered For example I created an audiobook that resulted in a json file with 5558 entries, but when it finished I only have 5002 wav files. An example of two entries that illustrates why:

{ "id": 5552, "fileUrl": "http://127.0.0.1:7851/audio/TTS_1705084624.wav", "text": "Bonifatius appears in the color art, too.", "characterVoice": "voice.wav", "language": "en" }, { "id": 5553, "fileUrl": "http://127.0.0.1:7851/audio/TTS_1705084624.wav", "text": "It’s raining grandfathers!", "characterVoice": "voice.wav", "language": "en" },

Notice that both entries point to the same wav file.

Additional context What I did is kind of a hack solution, I used Whisper to transcribe each wav file (I had to make batches of 1000 files at a time or the Gradio Whisper UI I used would spit an error), then I used a script to check every text entry in the json file.

script: compare.txt

That script let's me know within a certain threshold (specified when I run the command) wich lines I need to regenerate. I decided to use a less that 80% percent similarity and it found 807 entries with issues:

example: ID: 5552, JSON Text: bonifatius appears in the color art, too. TXT Content: it's raining grandfathers.

Differences: Similarity: 26%

I decided to check for those kind of discrepancies because Whisper is not perfect and it has troubles with names for example.

Hi @Suiyou I'm working on a larger update today, so I've taken a look at this too. I believe I've found what's going on. I noticed that both your sentences to be generated were quite short and as you point out, the filenames were the same. What was not happening was it wasn't carrying the unique ID for files generated at the same time, hence both files generated were generated so quickly, they both got the same timestamp, causing one to over-write the other. It looks as though the UUID was not applying correctly.

So I've re-jigged the code, and ensured the UUID is applying, which means that within any generation, occurring on exactly the same second, there are just over a further 1,048,000 additional file names possible, of which it should always ensure that they remain unique.

This should resolve that problem. Apologies for the issue. I will have the code uploaded to the site a little later today so you will be able to update. Below is a visual of one of the UUID's being applied after the timestamp.

Thanks

Yeah, I opted to use a chunk size of 1 to avoid the posibility of passing text that's over the character limit (I think it's 250), do you have any suggestion about this?. Your solution should work great for those files that were overwriten in my case. About what I said regarding Whisper, it would provide a way to check for mispoken text, but it may be too much to ask and I don't know how complicated it would be to implement, but it did help me identify various issues more or less quickly. One more thing, I was going through the Generated TTS List and noticed that everytime I regenerate a chunk the list jumps to the last page, I think that it shouldn't be the default behavior. Thanks for everything.

EDIT: another issue I had it's when trying to generate audio in japanese, if I try generating a wav chunk it just doesn't show any progress in the terminal, and if I try streaming it says:

ERROR: Exception in ASGI application Traceback (most recent call last): File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 259, in call await wrap(partial(self.listen_for_disconnect, receive)) File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 255, in wrap await func() File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 232, in listen_for_disconnect message = await receive() ^^^^^^^^^^^^^^^ File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 538, in receive await self.message_event.wait() File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\asyncio\locks.py", line 213, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 1f9efa54490

During handling of the above exception, another exception occurred:

Exception Group Traceback (most recent call last): | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 408, in run_asgi | result = await app( # type: ignore[func-returns-value] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\uvicorn\middleware\proxy_headers.py", line 84, in call | return await self.app(scope, receive, send) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\fastapi\applications.py", line 1054, in call | await super().call(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\applications.py", line 116, in call | await self.middleware_stack(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\middleware\errors.py", line 186, in call | raise exc | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\middleware\errors.py", line 164, in call | await self.app(scope, receive, _send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\middleware\cors.py", line 83, in call | await self.app(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\middleware\exceptions.py", line 62, in call | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette_exception_handler.py", line 55, in wrapped_app | raise exc | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette_exception_handler.py", line 44, in wrapped_app | await app(scope, receive, sender) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\routing.py", line 746, in call | await route.handle(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\routing.py", line 288, in handle | await self.app(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\routing.py", line 75, in app | await wrap_app_handling_exceptions(app, request)(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette_exception_handler.py", line 55, in wrapped_app | raise exc | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette_exception_handler.py", line 44, in wrapped_app | await app(scope, receive, sender) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\routing.py", line 73, in app | await response(scope, receive, send) | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 252, in call | async with anyio.create_task_group() as task_group: | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\anyio_backends_asyncio.py", line 678, in aexit | raise BaseExceptionGroup( | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) +-+---------------- 1 ---------------- | Traceback (most recent call last): | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 255, in wrap | await func() | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\starlette\responses.py", line 244, in stream_response | async for chunk in self.body_iterator: | File "C:\Users\shado\pinokio\api\text-generation-webui\extensions\alltalk_tts\tts_server.py", line 527, in generate_audio_internal | for i, chunk in enumerate(output): | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context | response = gen.send(None) | ^^^^^^^^^^^^^^ | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\TTS\tts\models\xtts.py", line 642, in inference_stream | text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\TTS\tts\layers\xtts\tokenizer.py", line 649, in encode | txt = self.preprocess_text(txt, lang) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\TTS\tts\layers\xtts\tokenizer.py", line 638, in preprocess_text | txt = japanese_cleaners(txt, self.katsu) | ^^^^^^^^^^ | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\functools.py", line 1001, in get | val = self.func(instance) | ^^^^^^^^^^^^^^^^^^^ | File "C:\Users\shado\pinokio\api\text-generation-webui\installer_files\env\Lib\site-packages\TTS\tts\layers\xtts\tokenizer.py", line 618, in katsu | import cutlet | ModuleNotFoundError: No module named 'cutlet' +------------------------------------

This is completly unrelated to what we were talking but I wanted to mention it. Thanks.

Yeah, I opted to use a chunk size of 1 to avoid the posibility of passing text that's over the character limit (I think it's 250), do you have any suggestion about this?. Usually a sentence chunk size of 2-3 is fine. I normally go with 2. Its not an absolute that something over 250 characters will cause an issue, just more likely. 250 characters is quite a lot though. And its more to do with individual sentences breaching that. Its complicated, but Id say you should be ok with 2. If you see it complaining at the command prompt about the 250 character length, then that's actually the model complaining and not AllTalk. If you don't see it complaining then things are splitting down ok . There's no absolute way to split things down without it breaking the pronunciation etc.

About what I said regarding Whisper, it would provide a way to check for misspoken text, but it may be too much to ask and I don't know how complicated it would be to implement, but it did help me identify various issues more or less quickly. Potentially in future, but not immediately. I've just made a lot of changes to the code and I need to sit on that for a while and see if anything crops up before I dive into other things. So perhaps as a future possibility. Ill make a note of it.

One more thing, I was going through the Generated TTS List and noticed that everytime I regenerate a chunk the list jumps to the last page, I think that it shouldn't be the default behavior. Will see what I can do.. I hated that bit of code and was glad to see the back of it.... but ill see what I can do at some point!.

another issue I had it's when trying to generate audio in japanese, if I try generating a wav chunk it just doesn't show any progress in the terminal, and if I try streaming it says: Your system doesnt have a native Japanese character set Im guessing. In your Python environment, at the prompt do:

pip install cutlet>=0.3.0

pip install unidic-lite>=1.0.8

Or just git pull an update and install the requirements file again. And that should resolve it. As they are only small installs, like mb's, I've added them to the installer requirements files.

I know that the character limit is not actually rigid but I just wanted to get the least amount of errors, in fact using a size chunk of 2, the third chunk it generates for the current book I'm trying, throws that warning. Well, nothing to do about it for now.

Potentially in future, but not immediately. I've just made a lot of changes to the code and I need to sit on that for a while and see if anything crops up before I dive into other things. So perhaps as a future possibility. Ill make a note of it.

Thanks, I just updated to the latest version, and it's currently running the same book to test it. So far it seems to be saving every file correctly.

Will see what I can do.. I hated that bit of code and was glad to see the back of it.... but ill see what I can do at some point!.

Sorry, just something I noticed while trying to fix the files.

I'll try what you suggested regarding japanese text. Thanks once again, bye.

erew123 / alltalk_tts

Use Whisper to check the generated audio in AllTalk TTS Generator. #67