Add ability to request specific audio types

erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.

GNU Affero General Public License v3.0

1.16k stars 123 forks source link

Add ability to request specific audio types #308

Closed C0rn3j closed 3 months ago

C0rn3j commented 3 months ago

I wish to get ogg as outputs, the Web UI allows to use filename as something.ogg and it generates fine (but it is not obvious or documented that you can do this), but the API only allows to set extensionless filename.

Benefits are huge - the difference is a 1MB wav vs 90KB ogg, for example.

erew123 commented 3 months ago

Hi @C0rn3j

I see you have made a couple of posts, but not sure I will get to them all today.

Re ogg etc, have you checked out the transcoding on AllTalk v2? https://github.com/erew123/alltalk_tts/tree/alltalkbeta

I am assuming you are talking about v1 of AllTalk?

C0rn3j commented 3 months ago

I have not checked out v2 at all, though it seems like v1 is able to generate ogg directly - at least my own transcodes with ffmpeg resulted in hilariously bigger filesizes than when telling alltalk to just save in ogg.

erew123 commented 3 months ago

Hi @C0rn3j The actual XTTS model output tensors are in a raw wav format, so although you can name the file something else, you are still getting a wav file, so a second step to transcode is required. V2 addresses pretty much all the issues here, along with many others and is a decent jump over v1. I'd advise checking out V2

C0rn3j commented 3 months ago

So the V1 main / page just does transcode on the wav before saving it and the gen page+API lacks the feature?

Saving as .wav vs .ogg:

EDIT: Yes -> https://github.com/erew123/alltalk_tts/blob/510bc2d1a3aa008d776172486666be5c4a38bcc9/tts_server.py#L566