SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
5.09k stars 498 forks source link

Req: a format for narration #185

Open GamingDaveUk opened 1 day ago

GamingDaveUk commented 1 day ago

Can we get a template that would allow for a narrator voice and a person speech voice...

For example:

Sergeant Maria Vasquez enters the room, saluting sharply before taking a seat across from Inquisitor David. She takes a deep breath, composing herself before beginning her account.
"It began normally enough," she starts, her voice steady despite the memories surfacing. "We'd completed our patrol along the border sectors, nothing unusual reported. The Ignavus jumped into warp, the stars stretching into lines of light outside the portholes. For hours, everything was routine - meals, check-ins, maintenance drills. Then suddenly… it wasn't.
The ship lurched violently, throwing us off balance. Alarms blared, red lights flashing everywhere. Our comms buzzed with panicked voices reporting system failures throughout the ship. Warp drive cut out unexpectedly, stranding us god knows where. Emergency lighting kicked in, casting eerie shadows on the bulkheads.
Gravity fluctuated wildly; one moment we were weightless, the next crushed beneath its force. I tried to reach the bridge, but corridors were blocked by debris or collapsed sections. I turned around, heading towards the armory instead. If we were under attack, I wanted to be ready.
But we weren't under attack, not directly anyway. Reactor core was overheating, pushing dangerously close to meltdown. Orders came flooding in via emergency broadcast - abandon ship! Everywhere, panic spread like wildfire. Marines, crewmen, everyone rushed towards escape pods. I helped evacuate injured personnel while others secured key areas before fleeing themselves.
My squad and I managed to launch three pods full of survivors before ours launched automatically due to lack of manual control. As we drifted away, I watched helplessly as the Ignavus tore itself apart. Explosions rocked the dying vessel, pieces flying off into space. Fire raged within visible sections, consuming anything left onboard…
All because some damn thing decided to tamper with our warp drive. Left us stranded, vulnerable. Lost good men and women because of it. Damned cowards, attacking anonymously…" She pauses, swallowing hard, eyes glistening slightly before continuing. "…and now here I am, alive, while most aren't."
Sergeant Vasquez looks down briefly, collecting herself before meeting David's gaze again, awaiting further questioning.

I would love to be able to select a voice for the narration parts and another voice for the speech parts.

As a bonus it would be cool if we could choose a voice for each speaker and triger them with cues like one of the other templates, and maintain a voice for the narration...ie

Voices:
David = male voice 1
Maria = female voice 1
Narrator = Narrator voice

(David)
I greet the first survivor, a marine.
"Tell me in your own words, what happened, start from when the Ignavus entered warp, do not rush we have plenty of time."

(maria)
Sergeant Maria Vasquez enters the room, saluting sharply before taking a seat across from Inquisitor David. She takes a deep breath, composing herself before beginning her account.
"It began normally enough," she starts, her voice steady despite the memories surfacing. "We'd completed our patrol along the border sectors, nothing unusual reported. The Ignavus jumped into warp, the stars stretching into lines of light outside the portholes. For hours, everything was routine - meals, check-ins, maintenance drills. Then suddenly… it wasn't.
The ship lurched violently, throwing us off balance. Alarms blared, red lights flashing everywhere. Our comms buzzed with panicked voices reporting system failures throughout the ship. Warp drive cut out unexpectedly, stranding us god knows where. Emergency lighting kicked in, casting eerie shadows on the bulkheads.
Gravity fluctuated wildly; one moment we were weightless, the next crushed beneath its force. I tried to reach the bridge, but corridors were blocked by debris or collapsed sections. I turned around, heading towards the armory instead. If we were under attack, I wanted to be ready.
But we weren't under attack, not directly anyway. Reactor core was overheating, pushing dangerously close to meltdown. Orders came flooding in via emergency broadcast - abandon ship! Everywhere, panic spread like wildfire. Marines, crewmen, everyone rushed towards escape pods. I helped evacuate injured personnel while others secured key areas before fleeing themselves.
My squad and I managed to launch three pods full of survivors before ours launched automatically due to lack of manual control. As we drifted away, I watched helplessly as the Ignavus tore itself apart. Explosions rocked the dying vessel, pieces flying off into space. Fire raged within visible sections, consuming anything left onboard…
All because some damn thing decided to tamper with our warp drive. Left us stranded, vulnerable. Lost good men and women because of it. Damned cowards, attacking anonymously…" She pauses, swallowing hard, eyes glistening slightly before continuing. "…and now here I am, alive, while most aren't."
Sergeant Vasquez looks down briefly, collecting herself before meeting David's gaze again, awaiting further questioning.

That could lend itself to a lot of versitilty, specially if you have a story with multiple characters and allow for inline changing of voice.... ie (David) "Why did you abandon your post?". Maria flashes a brief look of anger but manages to control herself (Maria) "I would never abandon my post!" (not the greatest of examples granted but you get my point lol)

Is this even possible? at the minute i uses xtts and have to generate line by line for this sort of content. Though when faced with speach marks xtts does change the tone of the voice a bit to make it clear the narrator is quoting someone..... not as good as having a whole new voice though.

SWivid commented 1 day ago

If I got you right, the Multiple Speech-Type Generation in space demo / gradio_app.py is what you need. And also the cli version, python inference-cli.py -c samples/story.toml ~

GamingDaveUk commented 1 day ago

If I got you right, the Multiple Speech-Type Generation in space demo / gradio_app.py is what you need. And also the cli version, python inference-cli.py -c samples/story.toml ~

so (regular) for narrator, then instead of emotions use (personname) before every single speech mark.... that could work. I will give it a go when i get some time.

GamingDaveUk commented 1 day ago

I have it installed on my laptop though I dont have very many voices on here (and the laptop has but 6gb vram) But it was enough to do a test. I used:

(Regular)I greet the first survivor, a marine.
(David)"Tell me in your own words, what happened, start from when the Ignavus entered warp, do not rush we have plenty of time."

(Regular)Sergeant Maria Vasquez enters the room, saluting sharply before taking a seat across from Inquisitor David. She takes a deep breath, composing herself before beginning her account.
(Maria)"It began normally enough,"(Regular) she starts, her voice steady despite the memories surfacing. (Maria)"We'd completed our patrol along the border sectors, nothing unusual reported. The Ignavus jumped into warp, the stars stretching into lines of light outside the portholes. For hours, everything was routine - meals, check-ins, maintenance drills. Then suddenly… it wasn't.
The ship lurched violently, throwing us off balance. Alarms blared, red lights flashing everywhere. Our comms buzzed with panicked voices reporting system failures throughout the ship. Warp drive cut out unexpectedly, stranding us god knows where. Emergency lighting kicked in, casting eerie shadows on the bulkheads.
Gravity fluctuated wildly; one moment we were weightless, the next crushed beneath its force. I tried to reach the bridge, but corridors were blocked by debris or collapsed sections. I turned around, heading towards the armory instead. If we were under attack, I wanted to be ready.
But we weren't under attack, not directly anyway. Reactor core was overheating, pushing dangerously close to meltdown. Orders came flooding in via emergency broadcast - abandon ship! Everywhere, panic spread like wildfire. Marines, crewmen, everyone rushed towards escape pods. I helped evacuate injured personnel while others secured key areas before fleeing themselves.
My squad and I managed to launch three pods full of survivors before ours launched automatically due to lack of manual control. As we drifted away, I watched helplessly as the Ignavus tore itself apart. Explosions rocked the dying vessel, pieces flying off into space. Fire raged within visible sections, consuming anything left onboard…
All because some damn thing decided to tamper with our warp drive. Left us stranded, vulnerable. Lost good men and women because of it. Damned cowards, attacking anonymously…"(Regular) She pauses, swallowing hard, eyes glistening slightly before continuing. (Maria)"…and now here I am, alive, while most aren't."
(Regular)Sergeant Vasquez looks down briefly, collecting herself before meeting David's gaze again, awaiting further questioning.

It did work though I had some oddities. it does not see ' - ' as a pause, noticed ai likes to use that a lot in generated text along with a solid line. same was true of '... ' it also did not seem to pause on voice changes.

The other oddity is a more system specific one, I have two gpu's in the laptop, a nvidia one and a intel one. the vram on the nvidia one was used for the storing of the model, but it used the intel one to generate the text... I am assuming it would be faster to use the nvidia one for generation, is there a way to fix that? I have attached the resulting wav file in a zip (cant attach .wav) so you can see what i mean by pauses.... though I add again that I have very few voices on my laptop so these were chosen at random... they are not ideal for the task and there is zero emotion in the reading lol.

audio.zip

SWivid commented 1 day ago

it does not see ' - ' as a pause

This is mainly for training set bias I thought, and we were just breaking it to two words. Not sure if you are turning on the remove_silence option in advanced setting or not, maybe it would be better to tweak https://github.com/SWivid/F5-TTS/blob/5600d9079a2813e9b0dc1ef9a52193604eed4828/gradio_app.py#L458-L460 as done for podcast https://github.com/SWivid/F5-TTS/blob/5600d9079a2813e9b0dc1ef9a52193604eed4828/gradio_app.py#L108-L115 with help of @jpgallegoar . Or you may serve as a pioneer to do so, providing good feedbacks or not helping us decide whether to add this on. lol.

And some tricks might be helpful. "It began normally enough,"(Regular) she starts, -> "It began normally enough,". . .(Regular) she starts, or even with zh_punc , to introduce extra pause.

For specify used gpu, try CUDA_VISIBLE_DEVICES=1 python gradio_app.py or other ways to order the specific device rank.

GamingDaveUk commented 1 day ago

I will likely edit the text that i generate the audio for to remove the pauses that are...well not pauses in the eyes of the software, just mentioned it as it would likely be better to have it do it when it processes the text. I dont know python and only know c#... tbh I am rusty at c# but i would have it do a string.replace(" - ",", ") when it processed the string to be sent to what ever function that converts the text into audio... Maybe a section on the ui for text replacement with the ability to add extra entries like you have for the speakers could be handy? might have other use cases beyond this for much longer text.

I cant reload it at mo as work is now a bit busier, but everything was pretty much default I added the speakers and the text, so if remove silence is on by default than that may indeed be the issue (dont recall seeing the option). I dont know enough about python to edit the files myself.

I have a venv setup to keep the environment used by this seperate from the main system. I doubt that will effect anything... but basically when installing I git cloned. Ran: python -m venv venv call venv/scripts/activate installed the two cuda modules as per the readme installed the requirements

and i use the following bat to load it:

@echo off

call venv/scripts/activate

python gradio_app.py

pause

I assume i would change the bat to:

@echo off

call venv/scripts/activate

CUDA_VISIBLE_DEVICES=1 python gradio_app.py

pause

to test your last bit. will give it a go when work is quiet enough that I am not stopping every 5 minutes to deal with something (as i have had to do while writing this lol)