erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
1.17k stars 123 forks source link

Reddit Continued #69

Closed KungFuFurniture closed 10 months ago

KungFuFurniture commented 10 months ago

Continuing a conversation from Reddit regarding HTML.

KungFuFurniture commented 10 months ago

Just LMK what you need from me to continue the "experiment". ~ [New-Cryptographer793]

erew123 commented 10 months ago

Hi @KungFuFurniture

I've had a long think about this as I mentioned on Reddit. Obviously testing this and trying to hunt any issue will tie up my time and my computers time quite a bit. So here is my suggested approach.

The reason I want to know what model + model template is used, is because the model template helps decide/split what text-generation-webui thinks is a character/AI/assistant text and User text. It could be that a specific model template is allowing things through and others don't.

Beyond that I want to separate are both extensions receiving the same types of text... and then figure if is it possible that the actual Python TTS to file method (what Coqui uses and AllTalks API TTS and API Local) are performing any additional level of filtering that could be cleaning out other data. In short, if there are any differences, I need to figure what step its occurring at (When text-generation-webui sends the data, or when Pythons TTS filters text, as I am still confident that both systems get the same text to generate.

If after 100x generations on each nothing shows up, both operate the same way (and there is no further filtering I can find performed by Python's TTS to file method), I assume you will accept that I've looked into it and there is no difference.

If on the other hand I find something, it may or may not be out of my hands to do something about. Ill obviously feed back on this. If there is something I could incorporate to filter, I will as long as its not 10+ days of coding type scenario. But if its out of my hands, (something in text-generation-webui or something specific to Python) and something I can do nothing about. I also hope you'll accept that outcome too.

If you have any other thought/suggestions/disagreements on my approach, lets iron them out now before I start out doing this.

Thanks

KungFuFurniture commented 10 months ago

Alright @erew123,

Let's dive in. First of all, the last thing I want to do is over monopolize your time/machine. This is just fun for me, so that being said, Thank you for your time and interest, thank you for your really cool extension. Your testing params seem super thorough. Maybe even a little excessive. I get this to happen on every generation without fail. So 100x maybe a little excessive on your time. If you don't get the same results in 5-10 generations, I would say we are having different experiences. The same would apply to any adjustments made.

As for model and model templates, first ooba... It does not seem to matter which model is used, SEEM, as again without fail I get the same results every time. Also, chat, chat-instruct, instruct, make no difference. If in chat or chat-instruct mode, the same results can be had by starting a new chat, which will spit out the character's greeting message, with or without an LLM model loaded. As a matter of fact the screenshots I provided were from this method (tho I did lots of testing, as requested, before taking those), so that I could ensure that the bot was giving the exact same response each time, and the pics would represent the same text converted to speech.

All that to say the model I usually use is mythalion-13b.Q5_K_M.gguf, which uses the "Metharme" instruction template, and comes from everybody's good friend Mr. TheBloke over on huggingface. It is 13b, but a 7b will give the same result faster. I have tried it with Mistral 7B, but as I am sure you know, that model needs some serious context before you start getting reasonable responses. I would suggest almost any other 7b for testing purposes. I always load with either Llama.cpp or the Hf version. I do from time to time add some instruction variables in the chat area, I will include that exactly in the readme that I send with the script. (results of tts do not vary based on any changes to these, at least for me.)

As for SD, for my script you will get the best results from a photorealistic model. I specifically use epicrealism_pureEvolutionV5.safetensors [76be5be1b2], which is a 1.5 model. I got it on Civitai. I typically set for a 768x512 with 25 steps, cfg 5, no upscale (well not that way), no face restore, (again not that way), Sampler : DPM++ SDE Karras, (though other samplers give fine results as well) Most of that is set in my script to be default.

As for the TTS model loaded, XTTS v2, Oobabooga will not acknowledge the existence of my GPU, so anything from TextGen, runs on my CPU, tho SD runs on the GPU, mostly.

My machine is: Ryzen 9 3900x (un-tweaked), AMD Strix rx 5700XT GPU, (also un-tweaked), 32gb ram, Windows 10, lots of coffee and foul language.

Umm I think that is everything. I am going to try and attach everything you need to this message. I should be available anytime, if you would like to arrange more direct correspondence on discord or whatever, shoot me a dm and we can figure that out. Pic Script.zip

erew123 commented 10 months ago

I haven't forgotten this I'm just stamping out a few small fires first and will be looking at this when I manage to calm the fire :)

erew123 commented 10 months ago

Hi Hi!

Well I freed up my time and I spent my morning investigating. I don't know if I should laugh, cry or what. Let me tell you what I did and then lets get to the results. Everything was setup as agreed, took maybe 30 minutes....

First thing I got.. was this (with the coqui TTS engine loaded):

image image

Disable (uncheck activate in the interface) the coqui TTS engine...

image

Very long story short. I ran these tests for a while trying to understand what the difference is there. I even defaulted back to the standard SD extension.

I tried with AllTalk after.. could I get it to "speak" all the base64 encoding stuff? 99% of the time no....

image

So I guess I ran tests for maybe 20-30 minutes and had a look through various bits of code, trying to figure out why the difference in image/no-image(code) depending on enabled/disabled status, then why there was an intermittency (very very rarely etc).

So here's my kind of conclusions on the matter:

1) All extensions that manipulate images/text or other within Text-gen use output_modifer as their call routine. Depending on the load order within Text-gen, this can affect the final output, which I think is partly why some things came back as base64 code OR an image, depending on the loading order of the extensions and what was enabled/disabled by its checkbox e.g. (Activate TTS, for example). This I believe also impacts what is passed to the TTS engine.... so if your SD extension is loaded BEFORE your TTS engine extension, then whats generated by SD is passed onto the TTS engine extension.... but if your TTS engine extension is loaded BEFORE your SD extension, then whats generated by SD is never seen by the TTS engine extension.... which could account for some of the perceived differences.

2) Potentially only 1 of the 2 returned output_modifiers goes through text-filtering/cleaning. This, again, mixed with load orders could confuse things.

3) In the SD scripts output_modifiers, it may be possible to push some extra filtering/cleaning bits before it provides the image return, so that it keeps text further separated (the alt section you see in the images), though Ive not looked into that.

So your question at this point is probably where is the issue, who/what is at fault/how do we fix it.....

Well ultimately everything should be handled within each extension and Text-gen being the central point should make sure that nothing is passed between extensions that shouldn't get passed (and it would have to account for load orders).... And Im sure as hell not going to deep dive into that.

BUT........ There is something I can do.... not that I should have to, but I can....

On my tests2 site... download script.py and tts_server.py (you will need both)

https://github.com/erew123/alltalk_tts/blob/test2/script.py https://github.com/erew123/alltalk_tts/blob/test2/tts_server.py

(Click the download raw file button, top right)

Save these over your existing files in the alltalk_tts folder and give them a go.

In short Ive put an extra filter in to strip any jpg/png base64 stuff that reaches AT from TG. It should stop anything other than text getting through. Other extensions will have their own issues with what gets sent to them, but this should keep AT clean as far as TTS goes.

Let me know how it goes as Ive not pushed this into the main AT yet and Ill need to do a bit more testing for myself with it.

Thanks

KungFuFurniture commented 10 months ago

First and foremost, let me say thank you for your time and insight. I mentioned this is a hobby for me, which makes your free time that much more valuable. Kudos on standing by the integrity of your work. (which was never really in question, fyi.)

If I understand correctly it seems as though it is an "order of operations" deal within TG, based on load order of the extensions, which makes sense to me.

It also sounds like I need to try and update my output modifier to handle itself better, in terms of any random extension that may come our way. I'll work on that for sure, any pointers are most certainly welcomed. I have actually been trying to figure it out to modify the visible history (which shows all that code hoopla and makes the chat log get out of hand quickly), which I think could help.... I'll play around, Throw some S#!& on the wall and see what sticks.

I will test your adjustments right now and let you know the results as soon as I have some. I will also to make sure and keep the originals of AT, and see if I can find a solve on just my end, for example the above history adjustments. (Doubt that's truly it though, or that I'll even figure out how.) I'll also look into that "alt" section. It's supposed to return the alt instead of the garbage, if for some reason it cannot display the image. (That's what GPT told me). It clearly does not work as intended, (at least as I intend but that bit is not my code it is original to the SD script), as it displays everything.

I do have to ask tho.... Did you get an appropriate prompt / image from my script / character? It works pretty well for me, but that doesn't really mean anything if, it's not repeatable. And, did you have a least a little fun?

Again, thank you for your time, energy, and bitchin extension. Keep up the good work!! I'll holler again once I have tested your adjustments.

--enable_Cheers

KungFuFurniture commented 10 months ago

So I have messed with it a bit today, I have not gotten different results with your changes, (no explicit errors either), tho I can see (at least some) difference in the code. So I do have the updated one. I'll keep horsing around and see if I can come to any insight. I will say in the meanwhile, if you keep messing with it, you gotta load the image script first for sure to get the similar results. If you use the native SD script, it's gonna be difficult to understand my response examples, cause I haven't had the native since.... Though it should still produce the same audio result. It's output modifier is pretty different than mine, but it does still produce the image the same way. See lines 452-477 of my script to see how that happens. It's the same on the native, just on different lines. (I didn't change any of that, but I would be happy to, if ya think it will help)

My best insight for the day is that I don't think either of us wants a script that works unless you use the other. (It now sounds like I am claiming the Native SD, or like anybody but you and me have mine, not the case). I am more that happy to help, test, plot, drive, dig, mix the concrete or whatever needs to be done, to help make them both work without limitations.

erew123 commented 10 months ago

Well, lets try something else. I cant 100% replicate whats on your system.... but I can get you to give me 1x copy of an example..Heres how. Download the script.py again and give it another run. I would suggest for this one 1x test, set your image size on SD to 512x512 as there will be a lot less data... and umm.. make sure its a clean photo.

https://github.com/erew123/alltalk_tts/blob/test2/script.py

image

The script will dump to the command prompt (as above) what it has been sent and what it see after trying to clean out JPG & PNG files. I suspect there is a cutoff in the data of some variety hence maybe the filter wont clean the image, as it doesnt see the end encapsulation of the image. (this is where I was on about the alt bit that got returned.) To put it simply, you can have other bits on there, but if you say I want to filter anything between A and B.... e.g.

A=============================B

then it can do that... but if you cut off the B, you will always get back

A=============================

Because it couldn't find a B to filter between.

So even though I can say, strip JPG or PNG, if they arent finished off correctly, then Ill never capture it.

The idea is that if you send me a copy of what 1x thing that gets dumped at your command prompt (2 at most) then I can try analyse it and see if there is a way to filter specifically for that scenario OR help with your code.

Also on my thoughts was literally changing the load order in the settings.yaml file or the CMD_Flags.txt and seeing if literally seeing if that will change the order that the output_modifier is being called in....

(Apologies for my slow replies. Ive been a little up against it over the last few days)

KungFuFurniture commented 10 months ago

To manipulate the load order of extensions, what I do is:

On the "Session" tab I make sure everything I want is ticked or checked. Click "Apply and Restart" once that's done I click "Save UI defaults..." That opens a window that is editable. In that window you will see your extensions, simply put them in the order you want. Click save. Shutdown the terminal and restart. It should load the extensions as you ordered them, forever or until you change it.

Ok So I downloaded the fresh script. I have included a shot of the terminal. Here's the thing. This shot is with the "save image and use in chat" option selected. You can see in the terminal screenshot below, that instead of all the image code it displays the file location. (WAY LESS DATA) It also reads it just like all that code, it just doesn't take 20 mins. So I hear loud and clear what your saying about A====B. Makes perfect logical sense... But even in this shorter version it still reads it.

1xTerminal

After this I went and took off the "alt" section in my script. No change.

Then I got nosey. Hope you don't mind. I looked into the your updated script, and went to the obvious line dealing with images... Line #571 img_pattern = r'<img src="data:image\/(jpeg|png);base64,[^"]" >'

I updated it to this: img_pattern = r'<img[^>]src="data:image\/(jpeg|png);base64,[^"]"[^>]*>'

This is just a bit less specific about the pattern of the image data coming in. And it stopped reading the image data, but it also did not display the image, and only works for the bas64, not the file destination when save image is clicked. But it's a step.

So I get to thinking, if it's gonna hijack the image where's it go? (obviously nowhere) So why don't we just take it out and put back after the tts. Then we can be pretty vague about the img_pattern we hijack, just get it if it's an image, cause we are gonna put back anyway.

This is what I ended up with this which with very little testing seems to work like a charm.

First Line 571 `img_pattern = r'<img[^>]src\s=\s["\'][^"\'>]+["\'][^>]>' audio_pattern = r'<audio[^>]>(.?)<\/audio>'

Extract information from img_pattern

img_matches = re.findall(img_pattern, string)
img_info = "\n".join(img_matches)
string = re.sub(img_pattern, '', string)  #unchanged from here just left in for reference
string = re.sub(audio_pattern, '', string)
original_string = string`

Then, on what is now Line 673

if params["show_text"]: if img_info: # Put back img_info if not empty string += f"\n\n{img_info}" string += f"\n\n{original_string}" shared.processing_message = "*Is typing...*"

And TADA!!! Seems to get the job done. Obviously you have to have the "show text" checkbox active in AT, but it works. If you want to stretch it beyond the show text option, or give images a checkbox or something, that'd be cool, but I just wanted to show you what worked for me, today. I'll keep testing and let you know any updates.

Sorry for the long form... This message has basically been open all day and journaling. LOL.

erew123 commented 10 months ago

Heya! In principle, I've no issues stripping things out, holding them in a wait area while TTS is generated, then shoving them back in at the end.

Can I ask just for clarity, so on your screenshot above, it showed that there was no base64 image data (all the dn8o3dl8hyd3n984y398d53495y type stuff etc.... so that was pre you making any changes? So it HAD stripped out the base64 image data, but not the reference to the actual file name and being an image source? (hence the <img src=/file......etc)

1) Is that correct, that what I had done stripped the data, but not the reference to the file existing?

2) and then you added extra modifications to further strip that and re add it back in after?

I'm not saying its a bad thing to do, but I have to be 100% certain of any change I push out en-masse (though I guess I could always flip it as an extra selectable parameter if I wanted to leave the options open). I just have to account for lots of other possibilities going on and use cases.

KungFuFurniture commented 10 months ago

All righty let's break this bad boy down.

First to address the Terminal image in my last message. With the Image extension/script, mine or the native, images are displayed in one of two ways. There is an option-checkbox, in the UI to "save the image and use in chat", or not. When this checkbox is active, instead of base64 you get the physical file location on your hard drive where the image was saved, (extensions\sd_api_pictures\output) what you see in the terminal image above. When this is unchecked, you get the base64 and the image is never saved to your machine. In the readme I sent with my script I suggested you use the checkbox, so that your testing didn't try to read 18 MINUTES of gibberish base64, its more like 38 extra SECONDS. So to answer your first question, Yes that image was before I made changes, but, no your code did not strip the base64 off of that terminal image, it ignored it all together, as it did not fit the base64 pattern your code was looking for. (This added confusion, as we were testing in different image return methods.) That being said, it also did not grab the base64 (at least from my script) as yours was slightly more specific about the pattern it was searching for.

This brings us to the first code example I sent. Let's break it down a little more...

Your code: img_pattern = r'<img src="data:image\/(jpeg|png);base64,[^"]*" *>'

My first attempt : img_pattern = r'<img[^>]*src="data:image\/(jpeg|png);base64,[^"]*"[^>]*>'

As you see they are both very similar, mine is just a little more lenient in the pattern, so if there is an extra space or comma, etc. It'll still catch the pattern. This worked to strip the string, for the base64 instance of image retrieval, but not for the file location, even though they both start off with the HTML image tag. Again it was a step in the right direction, but the big issue is that it kept the image and never gave it back.

That got me thinking, a couple things. One of those being that HTML is pretty flexible, and things may use an HTML image tag that are not actually images. For example, fonts or word-art styles, can come through that way. Either way, AT is reading the HTML, and I cannot imagine a scenario where anyone wants to hear the extra nonsense that comes with it, image or not. (I assume this is also why you added and audio filter) The other side is, it was not removing the file location instance, even though it had the HTML tag. So I updated that same line of code to this:

Current code: img_pattern = r'<img[^>]*src\s*=\s*["\'][^"\'>]+["\'][^>]*>'

This code will snatch an HTML tag, that is an image with a source, which pretty much will only have it grabbing actual images, vs something like a font that uses an image tag. But, it'll grab it no matter the source, base64, physical file location, url, etc. And won't be limited to jpegs and pngs. This should apply a bit of a "raincoat" for images that come to AT in a way that we are not privy to at this moment. Kind of a way to be open to other scenarios, from other extensions. So this successfully removes the reading of image info. But why use the image extension at all, if it is just going to steal the picture. So now we need to put the image back.

Again the A.D.D. kicks in my brain and says, what if there is more than one image, coming to AT. So this bit of code:

Find all images with source: img_matches = re.findall(img_pattern, string)
Which will search the bot response, or string for all instances of the image pattern described above. Now we need to put those somewhere...

Let's make a list: img_info = "\n".join(img_matches)
This takes all the found instances of the pattern and puts them each in a list. IF we were to put a print statement here the terminal would say something like:

Image1.jpg image2.png

This way everything the pattern snatches, is put back together, but as it's own separate entity. My script only gives you one image at a time, but what if someone were calling for a batch of images... Problem solved, they are now in a list, in the order they were received in.

Lovely, so at this point we have image/s out of the string and organized for safe keeping. But remember we are only renting here, so time to return them.

So now we jump drop down a few lines in the AT script and put it back. I chose to do this only in the part that handles delivery when the "Show Text" checkbox is active in the AT UI, for simple testing reasons. I would suggest applying it to more delivery methods, because if you don't want an image, turn the image extension off. But, I am also not trying to cannibalize your script, so if you would like to add the option elsewhere, or better yet, as you also mentioned, add a checkbox to display images or not, I think it would be a worth while endeavor. (I am happy to help with that)

Right so here is how we put it back:

if params["show_text"]: # this line is for locational reference in the script string += f"\n\n{img_info}" #This returns the so called stolen images string += f"\n\n{original_string}" #your original line of code to show the text under the image / TTS audio shared.processing_message = "*Is typing...*" #Again unchanged just a location marker to indicate the end of changes I made.

With this method, it shows the image/s and I think, in theory, would apply things like the font example back to the text, should it be snatched inadvertently by the img_pattern. (I don't really know how to test that part).

So I believe that this method, should "future-proof" AT from other image extensions, while maintaining it's quality. Any other HTML filters you have should not be effected, by this, or at worst work in tandem with them. But, if it were me, I would totally add a checkbox for this parameter, so if unpredictable events (as if anyone could plan for all outcomes) struggle, due to the flexibility of HTML, the user can simply check or uncheck the box to get the removed image "raw" or not.

I played with the bot all night, under different scenarios, and not one flaw I could see came up. AT performed perfectly, swiftly, narrator, character, all of it. The images came through, the text came through, the audio, all of it. LIFE IS GOOD! And it was pretty fast. I am soooooooo freaking STOKED!!

I am sure all of that is as clear as mud. Please don't hesitate to reach out, or, if you would like to schedule a Discord or something so as to be more articulate / comprehensive, I am down. Again whatever I can do to help.

BIG NOTE: I truly respect your request for some clarity before dumping potentially digital covid out to the masses. I want to say that I am just learning python, and I am teaching myself. What you read above is MY understanding of what is happening, that doesn't mean that it is 100% accurate. I asked my code llama, and (free) GPT if I explained that right (maybe a lil ai plagiarism in the explanation), and they both agreed.

Also, this is my first run on the Git forum, as far as I can tell (unless you feel otherwise) this is a problem solved. Feel free to close the issue, if you feel it is done. If not, that's totally cool, like I said I am in till we win, lemme know what else we need to do. Or if I need to close it because I started it, just LMK and I will do exactly that.

I would also be curious, if you had any opinions on the Script I sent you. You are the only one besides me who has ever seen it. First attempts being what they are.... Second opinions are valuable, but you have no obligations to that end.

Cheers and I hope you have had as much fun with this as I have, and will continue to have.

--enable extension coffee-break

erew123 commented 10 months ago

Cheers for the breakdown! I think that gives me plenty to go on. I'm back to fighting fires again, hence my very late reply... groan, they have updated transformers and it breaks things :/ trying to figure if its a bug or a breaking change.

So I will look to fully test and incorporate something like this into the main script, either as a selectable option or just the standard way to do it.

I'm going to close this for now, but ive added it here to my todo list https://github.com/erew123/alltalk_tts/discussions/74

Glad we've got a solution at least :) And thanks for your time/patience with this!

Re your script, functionality seems good :) You may be able to do a bit to tidy the gradio a little and I remember you saying "Sometimes, you have to set it to a different mode like manual, then back again for that to kick in. I dunno why." Ive not looked too much and sorry busy busy atm (had 4x emails as Im typing this),

I'm wondering if this could be a gradio 3.52 vs 4 issue of some kind... Id be tempted to try hard coding it initially and see if that resolves it (It should still update to new settings)

mode = gr.Dropdown(modes_list, value=modes_list[2], label="Mode of operation")

See if that clears up the initial issue. If it does then is something to do with how gradio is applying the setting. As your settings are coded in the top of the script, it probably wouldnt be an issue for now, but you may have to do some if statement on the gradio to make it deal with changes if you later on moved things out to a settings file to allow this to be saved/changed on startup.

If I get chance Ill try have a proper think on this.