Multiple reference files per speaker

erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.

GNU Affero General Public License v3.0

1.17k stars 124 forks source link

Multiple reference files per speaker #179

Closed Xyem closed 7 months ago

Xyem commented 7 months ago

Is there any plan to add the ability to provide multiple files per speaker/voice? When this was added to XTTS, it seemed to improve the quality of the voice (I presume because unwanted or intermittent aspects got averaged out of the result).

erew123 commented 7 months ago

Hi @Xyem

I'm happy to add it at some point, though I will have to draw some limits in places e.g. say 3x files through the API calls, or perhaps I could look at a JSON based grouping system for groups of WAV's that make up 1x voice.

What area of AllTalk are you looking at when you say multiple files e.g.

The API
Text-gen-webui
SillyTavern
etc...

Thanks

erew123 commented 7 months ago

I will just add, im not sure if their calls through anything other than the TTS command line allows for multiple reference files.

https://docs.coqui.ai/en/latest/models/xtts.html#id5

It will be something Id have to test.

Xyem commented 7 months ago

This is the PR that added it to xtts-api-server: https://github.com/daswer123/xtts-api-server/pull/24/files

I'm not that familiar with the code though, but it does look like calls the command line for the multiple files.

erew123 commented 7 months ago

Hi @Xyem

Thanks for the reference, however, its more than just a individual bit of code to update it as other areas will also need updating (I believe) and I will need to test how the external Coqui scripts are/arent handling it.

I think what I would look to do is update the AllTalk API to accept either a wav file OR a JSON file that is a collection of wav files, that way people would be able to build together a list of the files they want.

I'm currently deep in other code so will add this to the features request list and hopefully it will be something I can add into the next large release of AllTalk.

Thanks

Xyem commented 7 months ago

Sorry, I'm not sure I'm following what you are suggesting.

In xtts-api-server, you create a directory in the voices/speakers directory and put wav files in this speaker directory. xtts-api-server then lists the directory name as a speaker/voice when queried and when it gets a request to generate using that speaker/voice, it passes all the files in that directory to coqui backend.

Your suggestion sounds like you are offloading all that work to the caller of the AllTalk API which sounds very awkward and unnecessary.

erew123 commented 7 months ago

Hi @Xyem

I can test that as a method, though Ill have to think it through. There are quite a few code changes I am looking at currently, one of which is introducing other TTS engines.

If I do go down the route of using JSON files, then I would make something in the interface that would build the JSON for you.

As I say, I will think it all through.

Thanks