Improve inference speed on CPU while keeping flow adherence and accuracy

Checks

[X] This template is only for feature request.
[X] I have thoroughly reviewed the project documentation but couldn't find any relevant information that meets my needs.
[X] I have searched for existing issues, including closed ones, and found not discussion yet.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

1. Is this request related to a challenge you're experiencing? Tell us your story.

I was trying to create speech on CPU for a finetuned and reduced safetensors model, but encountered very slow generation: 7 minutes and 40 seconds for a sentence with 8 words with 40 literals. This was frustrating because my goal is to use it as a replacement for coqui-ai/TTS / TTSv2 but with that speed, it's hopeless. Coqui-ai/TTS / TTSv2 is fast in cloning and generation but hallucinates every single time and they are unable to fix it, which made me switch to F5-TTS.

2. What is your suggested solution?

Increase generation speed to be competitive in speed with coqui-ai/TTS but make sure to not sacrifice on flow and accuracy.

SWivid / F5-TTS