erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
706 stars 77 forks source link

2000 character string limit #129

Closed johnbenac closed 4 months ago

johnbenac commented 4 months ago

Is your feature request related to a problem? Please describe. I've got a long string of text, but when I try to render it to audio, I get this:

String should have at most 2000 characters

image

Describe the solution you'd like I want Alltalk to chunk my audio so that I can have long generations.

Describe alternatives you've considered I can limit the length of what I want to render to audio, but I dont want to!

Additional context This was what google gemini came back with.

erew123 commented 4 months ago

Hi @johnbenac

I can give you 3x options, depending on the end result you want:

1) There is the TTS generator which can generate unlimited amounts of text and compile them into one audio file (id you want) https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-alltalk-tts-generator 2) Streaming generation has no limit on it, though I am positive it will break down at some size. It will not generate wav files and you need to use/build some kind of audio return https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-tts-generation-endpoint-streaming-generation 3) You can change the limit set on line 886 of tts_server.py to something higher, however, you may find this has undesired effect on generation.

image

Beyond that, you would preferably need split the audio into generation requests within your own app before sending it over.

Thanks

johnbenac commented 4 months ago

Can you update that first link? It's the same as the second link

erew123 commented 4 months ago

Hi @johnbenac

The 2x links look different to me, but maybe something odd is occuring in the way its forwarding you. Both go to locations on the front page of the AllTalk github. If you look on the main AllTalk page you will find the AllTalk TTS Generator:

image

And very close to the bottom of the page, the API call for sending to streaming:

image

johnbenac commented 4 months ago

What it Alltalk split incoming text into chunks, and then merged those chinks together before returning them? I was thinking of doing something like this in sillytavern, but I'm not sure that we want to encumber the app (SillyTavern in this case) with the audio libraries to merge segements that come back from AllTalk.

So if AllTalk gets a really long bit of text, it can search for the paragraph break closest to the middle of the text. Then, with each of those chunks, if they are small enough, it renders them and concatenates them together before returning them. If those segments are too big, then it can break them down, again searching for the paragraph break nearest to the middle of the too-long chunk.

I think this might be easier and more effectivly done by AllTalk than the application. It's easy enough to split text at the application level, but for SillyTavern, splitting text is not easily done across messages, which are having audio generated.

I think it's a pretty safe bet to assume that most text is going to have paragraph breaks, and if it doesnt, AllTalk can look for sentence breaks (or periods followed by a space.

erew123 commented 4 months ago

Hi @johnbenac

SillyTavern already supports AllTalk https://docs.sillytavern.app/extras/extensions/alltalk/ I wrote the extension about 2 months ago and it should show up if you have an up to date version of SillyTavern. https://github.com/erew123/alltalk_tts/tree/main/templates/STfiles (all these files are included with ST now).

Both TTS methods are fully supported.

As far as splitting down paragraphs goes, that's fully possible to do from a coding standpoint, as most of the text filtering options within the API remove paragraph breaks as they can do things that can cause TTS audio generation err "strange noises" (lets call them that). There are however restrictions within the XTTS AI model however as to how much you can push into the model at one time, so that can add complexity to how exactly you split/break down larger chunks of data.

Splitting paragraphs would also require the Narrator be disabled I think, As its already quite a complicated task to split what is Narration and Character. So Narration would be a feature that couldn't be used in such a large block of text being split.

Further to that, the Narrator (if used) splits the audio into multiple separate generations then combines them at the end, before returning the wav.

All that splitting and re-combining is handled within AllTalk and not directly within SillyTavern.

Larger than 2000 characters is a lot though. Are you really seeing generations larger than that being sent by SillyTavern? Or could you further describe the use case as to where the issue is with (whats creating such large text generations)?

image

Thanks

johnbenac commented 4 months ago

I just got a generation from Claude 3 opus that is 7886 characters and 1336 words. It's a single sillytavern message, the 10th message in a one character non group chat. It's in response to a two word user message. The authors note is "The heart of all good drama is tension and conflict. In this case, inner conflict. Convey good storytelling through vivid imagery, kinesthetic storytelling, environmental storytelling, sartorial detail, corporeal aesthetics, and inner monologue, show dont tell, subtlety and and subtext!"

There are seven brief lines of dialogue from the character within this one generation, which I would want rendered in the character voice, and the rest of the 33 paragraphs in this single generation is the narrator, which I would want narrated in the narrator voice.

With Claude 3 opus, and other models coming, these generations are going to get longer and longer.

And it's all good quality, in character... useful. if only I could get it narrated!

It's a very good message, and I want it to be narrated!

erew123 commented 4 months ago

Hi @johnbenac

Does it follow generated text in the format narrator text is using asterisks either side of the messages? Please see here to understand what I'm asking:

https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-a-note-on-character-cards--greeting-messages

Either way, that's a hell of a lot of text. Is it possible, well in theory yes, but I would only do it on any narrated text, which already splits, however (off the top of my head) there would have to be a chunk of code written to perform a second assessment of blocks of text and split those where over-sized for the model to generate in one go, then intermediate TTS generations, then a concatenation of the TTS generations into 1x wav, which then has to be handed back up into the main narrator/character routine for adding to an array which would be dealing with other generations as necessary, and then a final generation/concatenation to a wav file that can be handed over to the user. The API would also need some revision to only handle larger blocks of text with the narrator (only).

There is no way to just hand that amount of text over to the XTTS model as it wont deal with that much text at once. The narration splitting as it stands wont split text into small enough chunks for the XTTS model to handle when dealing with such large blocks of text. I also have no idea what kind of memory overhead may occur on such large text blocks.

There are other considerations as well (coming back to my first point) about how the text from something like Claude would be generated across paragraphs, where its putting asterisks etc to separate narrator/characters e.g. lets say there are 2x paragraphs that are to be narrated, one paragraph after the next.... does Claude present them as:

paragraph 1 paragraph 2

or

Paragprah 1 Paragraph 2

As this kind of thing can majorly effect how I would have to look at narration splitting, and some existing filters.

So is it possible, well, theoretically yes. However, saying that, I would estimate 14 to 20 hours work to get something in place and test (which is key) that its doing what it should be doing without either doing something unexpected or causing a crash of some kind.

So Ill add it into the list of features requested, but it is a decent undertaking for sure and if I do choose to attack it, its certainly something I would want to think about the implementation a little more.

erew123 commented 4 months ago

Feature request list https://github.com/erew123/alltalk_tts/discussions/74

johnbenac commented 4 months ago

I don't have a novel AI account, but it looks like the novel AI integration with Sillytavern already chunks the audio. Maybe studying the code for the novel TTS can show a way that this can be accomplished more easily.

erew123 commented 4 months ago

Hi @johnbenac

Breaking it into chunks is easy. Breaking it into chunks with Narration and character split, honestly that's a whole different ball game.

Novel, all its doing it splitting the text block into max 1000 characters long, it doesnt handle narration:

image

If you just want the 1x voice, no narration, splitting it is easy as hell. We can just say every time there is a full stop, generate that text and at the end build it all into one wav file.

Narration has to handle a LOT of issues around the Asterisks, quotation marks and text that contains neither. Narrated text is surrounded by Asterisks and Character text quotations. if those asterisks or quotations get broken out because of how you split the text down, then you wont get narration at all. The other consideration here is that the XTTS AI model can only handle generating around 250 characters at a time, so you HAVE to split the text down if its over 250 characters. With huge blocks of paragraphs as you are suggesting Claude sends, where are the asterisks in that? e.g.

***** PARAGRAPH 1 - Here is a huge block of text that starts with an asterisk and so is something 
that is to be narrated, if we can find the closing asterisk that surround narrated text.........

PARAGRAPH 2 - 300 words long

PARAGRAPH 3 - 140 words long

PARAGRAPH 4 - 200 words long

PARAGRAPH 5 - 270 words long

etc...................

PARGRAPH 23 - 200 words long

PARAGRPAH 24 and here is the final bit, which is 24 paragraphs down, but we finally only reach 
the closing Asterisk here *****

"Finally we have something the character can say"
"and something else the character can say"

If I say, ok just split that every 1000 characters, well, that means we send over the first 1000 characters of text (paragraph 4 maybe), which is a narrated text, BUT, because the closing Asterisks wont be in those first 1000 characters, how do we know its the Narrator, as there are no closing Asterisks now in the text to be generated, nor any asterisks in the the next 1000 that are to be sent.

How do we know that the next block of code sent over is Narrator or Character? We cant just the 1000 characters with asterisks and think that will be ok, because, what if we hit a character quote in the middle of the next 1000 lines of text we split out? etc. It gets complicated fast.

So not only would have to handle splitting of the entire block of text for narration before its sent to be generated, you have to deal with re-compilation, you have to look at how/where the AI places Asterisks and Quotations etc. e.g. is the AI putting Asterisks around every paragraph that's narration OR is its just the start paragraph and end paragraph (as my example shown above).

I know this sounds like something that should be simple, but its not when you get into the meat of it. I think the narration bits I worked on, I put probably 25-30 hours into that, dealing with issues, outlier problems. I did something very different from the standard narration that people use. It was hard to get it working correctly and handling lots of different AI models.

So if you just want large blocks of text to be the 1x voice, then I can do that pretty simply. Narration will be a completely different issue.

I hope that explains the complexities a little better.

Thanks

johnbenac commented 4 months ago

Well, I dont think that Alltalk needs to be able to withstand any syntax. Even as it is now, I some times get john doe walks up to the counter. "I'll take a hot dog" John waits for the hot dog

And it wont work, because the quotes are within quotations.

So as a user I'm already used to trying to coax different models to follow the rules for the syntax, and as a user, thats fine being my burden, and if the model fails, I change the text before manually having it narrated.

Sometimes, I just go in and change the text before sending it to get narrated:

john doe walks up to the counter. "I'll take a hot dog" *John waits for

the hot dog*

And I can do that manually. And what I usually do anyway is have the setting where things that are not in quotes or asterisks are narrator voice, and things that are in quotes are the character.

I think that if you make this chunking work, and I had this setting where all taxt that isnt in quotes (unless those quotes are in asterisks), then you wouldn't have to change any of the logic. It just wouldn't work for

*John Does thinks about eating a hot dog. <over 2000 characters of

additional text> John Doe walks up to the counter.

"Let me think about what to order. hmm. <over 2000 characters of additional

dialogue> I'll take a hot dog"

*John waits for the hot dog. <additional 2000 characters of narration> John

gets the hot dog*

If this were the incoming text, that Alltalk could build the chunks not based on whichever paragraph break was in the middle, as sometimes that might be in the middle of dialogue. Rather, Alltalk could search for dialogue sections that are over 2000 characters. If there are any dialogue sections that be themselves are over 2000 characters, then Alltalk returns an error. Perhaps an audio file stating the nature of the error.

If, however, there are no segments of text where single sections in quotes are over 2000 characters (Which I think is pretty likely), then those are the "big rocks" in the jar of sand, if you will, and those are the first pass at how the segments break down. After that, the narration should be easier to chop up. Especially if you arent worried about asterisk sections spanning paragraphs or chunks, because this is primarily for text where anything not in quotes, asterisks or not, are done in the narrator voice.

You could have this sequence in the code: 1) parse the text according to dialogue sections 2) parse the text between the dialogue into paragraphs 3) combine these smaller sections of dialogue and non-dialogue paragraphs into as large of segments as will not exceed the 2000 character limit 4) render these <2000 character chunks, which may include dialogue, dialogue and narration, or just narration. 5) merge these audio files 6) return them to the application

All talk is never going to always get the right formatting, and I think it's ok, especially at this point, to put some burden on the user at runtime to edit the text so that it conforms.

I think the simplest thing would be to ignore all asterisks for any incoming text that is over 2000 characters, and render anything in quotes in character, and anything else as narrator. That way, it wouldent matter if quotes were enclosed in asterisks, because you would be ignoring asterisks. The text frontend could still use asterisks to format the text in italics, but Alltalk doesnt have to worry about if something is in asterisk or not, Just if it is in quotes.

On Sun, Mar 17, 2024 at 7:44 PM erew123 @.***> wrote:

Hi @johnbenac https://github.com/johnbenac

Breaking it into chunks is easy. Breaking it into chunks with Narration and character split, honestly that's a whole different ball game.

Novel, all its doing it splitting the text block into max 1000 characters long, it doesnt handle narration:

image.png (view on web) https://github.com/erew123/alltalk_tts/assets/35898566/99b6f809-2778-4019-b9f5-74e3aecd7e8e

If you just want the 1x voice, no narration, splitting it is easy as hell. We can just say every time there is a full stop, generate that text and at the end build it all into one wav file.

Narration has to handle a LOT of issues around the Asterisks, quotation marks and text that contains neither. Narrated text is surrounded by Asterisks and Character text quotations. if those asterisks or quotations get broken out because of how you split the text down, then you wont get narration at all. The other consideration here is that the XTTS AI model can only handle generating around 250 characters at a time, so you HAVE to split the text down if its over 250 characters. With huge blocks of paragraphs as you are suggesting Claude sends, where are the asterisks in that? e.g.

***** PARAGRAPH 1 - Here is a huge block of text that starts with an asterisk and so is something that is to be narrated, if we can find the closing asterisk that surround narrated text.........

PARAGRAPH 2 - 300 words long

PARAGRAPH 3 - 140 words long

PARAGRAPH 4 - 200 words long

PARAGRAPH 5 - 270 words long

etc...................

PARGRAPH 23 - 200 words long

PARAGRPAH 24 and here is the final bit, which is 24 paragraphs down, but we finally only reach the closing Asterisk here *****

"Finally we have something the character can say" "and something else the character can say"

If I say, ok just split that every 1000 characters, well, that means we send over the first 1000 characters of text (paragraph 4 maybe), which is a narrated text, BUT, because the closing Asterisks wont be in those first 1000 characters, how do we know its the Narrator, as there are no closing Asterisks now in the text to be generated, nor any asterisks in the the next 1000 that are to be sent.

How do we know that the next block of code sent over is Narrator or Character? We cant just the 1000 characters with asterisks and think that will be ok, because, what if we hit a character quote in the middle of the next 1000 lines of text we split out? etc. It gets complicated fast.

So not only would have to handle splitting of the entire block of text for narration before its sent to be generated, you have to deal with re-compilation, you have to look at how/where the AI places Asterisks and Quotations etc. e.g. is the AI putting Asterisks around every paragraph that's narration OR is its just the start paragraph and end paragraph (as my example shown above).

I know this sounds like something that should be simple, but its not when you get into the meat of it. I think the narration bits I worked on, I put probably 25-30 hours into that, dealing with issues, outlier problems. I did something very different from the standard narration that people use. It was hard to get it working correctly and handling lots of different AI models.

So if you just want large blocks of text to be the 1x voice, then I can do that pretty simply. Narration will be a completely different issue.

I hope that explains the complexities a little better.

Thanks

— Reply to this email directly, view it on GitHub https://github.com/erew123/alltalk_tts/issues/129#issuecomment-2002662321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOBXGFNV2I2E26SNSY5C4TYYYTHNAVCNFSM6AAAAABEW32F72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGY3DEMZSGE . You are receiving this because you were mentioned.Message ID: @.***>

erew123 commented 4 months ago

I appreciate what you are saying, but it is more complicated than that. I honestly thought when I set out to build a narrator it would be a simple task. Most narrators just flip when they hit an Asterisk or a quotation mark, which is nice and simple but results in very mixed results and I worked hard to find another way to do it and also handle lots of edge cases as well as text cleaning to try get the

1) As mentioned the XTTS AI model can only handle generating 250 characters at a time. So you have to abide by that limit at point of generation. Meaning even if you send over 2000 characters, things still have to be broken down into max 250 character chunks AND they HAVE to be complete sentences, so there is a level of complexity to deal with there, as you want to also ensure that a 250 character chunk is ALL the character OR narrator, it can only deal with one voice at time of generation. 2) The narrated and non-narrated text splitting has to be handled all in one chunk. If you just split the text in the middle of something the character is saying, it will not be generated as one flowing sentence, resulting in it being pronounced incorrectly, with the intonation and emotion will be lost. Additionally, the second part of the generation which no longer has quotes will be generated as.............??? Splitting this stuff down to a level where it worked was large undertaking for me and adding the breakdown of very large chunks of text is another level to figure out. 3) Please bear in mind I have a compatibility question to deal with too. I have to ensure that any changes made don't break what already exists, so that is an addition level of testing, checking etc. I am not going to break what already exists. If you have a model that is correctly responding, it works fantastically well and I put a hell of a lot of effort into it. 4) another issue you face is that if you only send over 2000 characters to be generated (lets say out of 5000) characters, well, the only thing that AllTalk knows is that there are 2000 characters to turn into a WAV file and return to SillyTavern. That's literally what it will do, send a WAV file back. As far as I am aware, SillyTavern will not queue a second audio file when you send the next 2000 characters over, it will just play the next WAV file the second when its returned, even if you are mid listening to the current WAV file. (research and testing needed here).

As mentioned earlier, I've added it to the Features request list and I will continue to mull it over in my head and see where I get with the idea. Please also understand that I built AllTalk for passion and fun in my own personal time, along with dealing support requests (here, reddit etc), there is pretty much only myself doing anything with it. So if/as/when I work on it, one of my larger focus's is trying to reduce my support load, maybe fix bugs or introduce new features. Ill see where I get to in my thoughts on it and potentially tackle it somewhere down the line.

johnbenac commented 3 months ago

Hi, so there is an error that outputs to the browser js console log (screenshot in the first message in this thread) that says what the problem is when you have too much text, but there is no indication in the command line terminal actually running AllTalk what the issue is.

My use case is this: I run AllTalk on my computer, and I run SillyTavern on my android mobile device.

My computer is slow. the two biggest reasons that a generation won't happen is 1) because I haven't properly connected sillytavern to AllTalk or my android to my network, in which case, AllTalk never gets the request.

or

2) The next major cause is because the length of text to be rendered to speech is too long.

I get no feedback from ST on Android in either case as to why a generation won't happen. There is no toast message. That actually might be the simplest way to let me know. Anyway...

So... looking at the AllTalk terminal (which I do using chrome remote desktop from my android) doesn't tell me which is the reason my generation isn't happening, because nothing outputs to the terminal in the case of the text being too long.

I request either:

1) update the ST code to output toast messages for various reasons that the generation doesn't happen (eg. no connection or text too long) or output to the terminal an error message that says the failed http response that is currently displayed to the console log. Obviously, I don't see the console log when running sillytavern from my android mobile device.

or both!!

erew123 commented 3 months ago

Hi @johnbenac

So the error in the browser developer console is the JSON return message from the API hitting the barrier I mentioned in the 2nd post on this chain (I've included that snippet below).

I should be able to allow an output at the AllTalk command prompt/terminal window to notify that a limit was reached, however as far as notifying back into SillyTavern, this may be possible somehow, BUT if so, its not a documented feature of SillyTavern's TTS extension developers guide, which can be found in \SillyTavern\public\scripts\extensions\tts\readme.md and I dont know SillyTaverns inner code to say that there could be some way to call an error.

  1. You can change the limit set on line 886 of tts_server.py to something higher, however, you may find this has undesired effect on generation.

image

johnbenac commented 3 months ago

ok, well, putting something in the terminal window should help me. Maybe I'll poke around in the ST code to see if I can get it to output something. Right now, if you type "/echo Hello World" it outputs a temporary message in the style that I think would be helpful.

I just made this pull request for SillyTavern, which should display these messages for lots of TTS errors, not just the ones that I described, and not just for all talk.

image

https://github.dev/johnbenac/SillyTavern/tree/TTS_Toastr_Error_Message

This PR may not be accepted by ST, but if you wanted to wait to see if it is accepted, then I'd be fine with my solution in the application. I do think that others may benefit by your putting more debugging info in the terminal.

erew123 commented 3 months ago

Sounds good. If they do accept it in, then Ill be able to do something in future with that.

As for AllTalk terminal output https://github.com/erew123/alltalk_tts/commit/fff01107c89cea9e7f0405e247da0f39637dc957

That should give you pretty much what you want.

Thanks

johnbenac commented 3 months ago

great! the ST PR was also approved.

GamingDaveUk commented 2 months ago

I have the same issue at the minute, I see this issue shows closed and that a silly tavern pull request was approved a month ago, but this is a fresh install today. has an update broken the fix?

erew123 commented 2 months ago

Hi @GamingDaveUk

If you want to extend the amount of characters AllTalk processes, you can do so by following option 3 from my reply above https://github.com/erew123/alltalk_tts/issues/129#issuecomment-1998588593

I have no idea about the state of @johnbenac PR and StillyTavern, but he may reply at some point.

That aside, AllTalk v2 which I hope to release in a few weeks, at least as a beta, will have up to a 10,000 character limit which can be set through the interface.

image

Thanks

GamingDaveUk commented 2 months ago

Hi @GamingDaveUk

If you want to extend the amount of characters AllTalk processes, you can do so by following option 3 from my reply above #129 (comment)

I have no idea about the state of @johnbenac PR and StillyTavern, but he may reply at some point.

That aside, AllTalk v2 which I hope to release in a few weeks, at least as a beta, will have up to a 10,000 character limit which can be set through the interface.

image

Thanks

Currently using option 1... thats working well, will wait for the v2 fix (I am not keen on editing files as it makes it hard to do git pulls lol)

Can I request a way to export as mp3 though? Currently I use silly tavern to create funny stories from a Scribe, A newspaper and a slightly unhinged PM that does press releases for the benifit of the guys in our gaming group. I was using xtts directly but the development on that is somewhat hit and miss so was looking for an alternative when i found your code. To put the file into discord I have to convert it from wav to mp3 or its too big (and i get moaned at by the guys lol) being able to use that tts generator page you mention as a fix but exporting as a mp3 would be a good send.

On a side note, your code is appreciated a ton, thank you.

erew123 commented 2 months ago

@GamingDaveUk RE: Can I request a way to export as mp3 though?

Its been on my mind, though it will require transcoding, which being multi platform makes the code a little more complex. I've a very large list of code to be getting through, so I may/may not do it by the time I get to the first v2 release. Will be a case of wait and see.

Thanks

johnbenac commented 1 month ago

Maybe December 2024, you can give me longer lengths (or indefinite lengths) with narration as a Christmas present?

erew123 commented 1 month ago

Hi @johnbenac You can set up to 10,000 character length within the API on v2 (as shown 2-3 posts up). There is no way for me to validate each TTS engine is capable of handling that kind of length of characters and many manufacturers engines just will not be capable of dealing with longer lengths. Additionally queuing, buffering, merging audio (esp streaming) and trying to feed that back to something that is expecting an individual audio file/stream is highly complicated/not possible in many cases. I may do more with queue management in future, but Ive such a long list of things to implement at the moment, its not something top of the list.

johnbenac commented 1 month ago

! the ST PR was also

The

Hi @GamingDaveUk

If you want to extend the amount of characters AllTalk processes, you can do so by following option 3 from my reply above #129 (comment)

I have no idea about the state of @johnbenac PR and StillyTavern, but he may reply at some point.

That aside, AllTalk v2 which I hope to release in a few weeks, at least as a beta, will have up to a 10,000 character limit which can be set through the interface.

image

Thanks

The PR was approved, in that, it tells you why the generation isnt working. On April 5th, I commented above showing a screenshot of the change where the user gets notified about the API error.

johnbenac commented 1 month ago

Hi @johnbenac You can set up to 10,000 character length within the API on v2 (as shown 2-3 posts up). There is no way for me to validate each TTS engine is capable of handling that kind of length of characters and many manufacturers engines just will not be capable of dealing with longer lengths. Additionally queuing, buffering, merging audio (esp streaming) and trying to feed that back to something that is expecting an individual audio file/stream is highly complicated/not possible in many cases. I may do more with queue management in future, but Ive such a long list of things to implement at the moment, its not something top of the list.

hallelujah! it works! Christmas comes early this year! Thank you so much!!! This is exactly what I need, and it is going to make all the difference. Thanks a million!!!

I got v2 installed. I had one glitch:

Microsoft Windows [Version 10.0.19045.3803]
(c) Microsoft Corporation. All rights reserved.

F:\tts\alltalkv2\alltalk_tts>start_alltalk.bat
[AllTalk TTS]←[94m     _    _ _ ←[1;35m_____     _ _     ←[0m  _____ _____ ____
[AllTalk TTS]←[94m    / \  | | |←[1;35m_   _|_ _| | | __ ←[0m |_   _|_   _/ ___|
[AllTalk TTS]←[94m   / _ \ | | |←[1;35m | |/ _ | | |/ / ←[0m   | |   | | \___ \
[AllTalk TTS]←[94m  / ___ \| | |←[1;35m | | (_| | |   <  ←[0m   | |   | |  ___) |
[AllTalk TTS]←[94m /_/   \_\_|_|←[1;35m |_|\__,_|_|_|\_\ ←[0m   |_|   |_| |____/
[AllTalk TTS]
[AllTalk TTS] ←[92mConfig file update: ←[93mNo Updates required←[0m
Traceback (most recent call last):
  File "F:\tts\alltalkv2\alltalk_tts\script.py", line 190, in <module>
    import gradio as gr
  File "F:\tts\alltalkv2\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\__init__.py", line 3, in <module>
    import gradio._simple_templates
  File "F:\tts\alltalkv2\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\_simple_templates\__init__.py", line 1, in <module>
    from .simpledropdown import SimpleDropdown  File "F:\tts\alltalkv2\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\_simple_templates\simpledropdown.py", line 6, in <module>
    from gradio.components.base import FormComponent
  File "F:\tts\alltalkv2\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\components\__init__.py", line 1, in <module>
    from gradio.components.annotated_image import AnnotatedImage
  File "F:\tts\alltalkv2\alltalk_tts\alltalk_environment\env\Lib\site-packages\gradio\components\annotated_image.py", line 9, in <module>
    import PIL.Image
  File "F:\tts\alltalkv2\alltalk_tts\alltalk_environment\env\Lib\site-packages\PIL\Image.py", line 100, in <module>
    from . import _imaging as core
ImportError: DLL load failed while importing _imaging: The specified module could not be found.

(F:\tts\alltalkv2\alltalk_tts\alltalk_environment\env) F:\tts\alltalkv2\alltalk_tts>

and so I installed

pip install --upgrade --force-reinstall pillow

And I was on my way.

Also, for some reason, I have to redo all my voice mappings in ST, because the ones I had got zeroed out. Once I remap a character to a voice, in the ST interface, it works just fine. It's strange,

Here is the console log in silly tavern:

Voicemap updated to {"[Default Voice]":"Sally.wav","John":"male_02.wav","Jane Doe":"female_06"}
eventemitter.js:52 Event emitted: settings_updated
index.js:861 Voicemap updated to {"[Default Voice]":"Sally.wav","John":"male_02.wav","Jane Doe":"female_06.wav"}

Now, this was after I had already remapped the default voice and the john character, but I did catch this change in the console log to share with you all.

As you can see, with the new v2, alltalk wants to have the *.wav or file extension expcitly used, whereas with v1, it didnt need that. Maybe there is something you can do to help people make the transition. For a little bit there, I didnt know why it wasnt working!

I will admit, my code PR that got accepted by ST, where ST outputs the error to a toast message that alltalk sends back, did help me know where to start looking for the problem.

So check if other ST users have this problem, perhaps, and if they do, maybe warn them, or just figure out a way to make the migration transparent.

My main message is THANK YOU!!!!!!!

Thank you thank you.

This is great and perfect and awesome.