Can inference be accelerated and implemented in real time?

WGS-note commented 4 days ago

Checks

[X] This template is only for question, not feature requests or bug reports.
[X] I have thoroughly reviewed the project documentation and read the related paper(s).
[X] I have searched for existing issues, including closed ones, no similar questions.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Question details

Hi, Can inference be accelerated and implemented in real time?

你好，推理可以加速吗，实现实时的效果

mame82 commented 3 days ago

Saw similar questions here quite often, so I give a few (reusable) answers.

F5-TTS is realtime, but it depends on what you try to achieve and how you use a model.

For example on a RTX 4070 Laptop GPU it takes about 1 minute to generate 2 minutes audio, which comes down to a real time factor (RTF) of 0.5 on a moderate GPU.

What F5-TTS is not capable of, is to stream audio back in chunks. In practice this means, if you provide input text like "Hello, my name is Tom" the model is not able to stream back small audio chunks like Hel ... lo ... my ... name ... is ... Tom. If this would be the case, it would be possible to play back chunks immediately (low latency output, which is not exactly the same as realtime speed).

What does this mean? Due to its flow based nature, the model creates longer audio output (think of it as whole sentences). Now the issue with the current implementation is, that if your text input exceeds the audio output limit (provided reference audio and generated audio output duration are summed up and mustn't exceeed 30 seconds), audio output generation will be processed in multiple slices. Unfortunately, the implementation of this "batch processing" only returns the concatenated final result (not individual slices, once they are processed). If you provide longish text input, which has to be split into 10 output slices for example and processing of each slice takes 5 seconds, you will receive the result after 50 seconds (which of course makes a long delay, even if the generated output could be two minutes long and processing still was faster than realtime speech).

So if you want faster responses, I suggest to slice your input text before inference. The easiest way to do this (at least in my target language) is to slice on a "per sentence level" or - if slices still are to long - fallback to split text between words or commas to keep input text slices short (avoid slicing by F5-TTS inference code).

This again would allow you, to play back the first sentence once it is processed (lets stay with the example from above, which means after 5 seconds instead of 50 seconds). With a RTF of 0.5 this would mean audio playback of the first sentence takes 10 seconds, in the meantime the next two text slices could be transformed into audio (as processing of a slice takes only 5 seconds). In the end, for the given example, it takes you still 50 seconds to generate all the audio, but playback time is about two minutes and playback starts after 5 seconds (everything assumes RTF 0.5, which depends on your hadware).

So in short words: While the inference implementation supports no streaming, you could do it on your own, but on a per-sentence-level instead of a per-token-level.

WGS-note commented 3 days ago

Thank you very much for your reply! I would like to ask one more question: how do I implement the speed of speech?

mame82 commented 3 days ago

the infer_process function takes a speed argument which defaults to 1.0. Setting it to 1.1 for example increases the speed, setting it to 0.9 slows it down (https://github.com/SWivid/F5-TTS/blob/main/src/f5_tts/infer/utils_infer.py#L349).

It should be noted, that the speed setting actually scales the prediction for the output audio duration up or down ... the model basically has to fit in the infered audio, based on the estimated output duration. Say you set speed to 0.5 while your reference audio is "speaking fast". In result the generated audio could have repeated words or other errors (as a twice as long audio output has to be filled, while the reference audio indicates a high word frequency). You have to mess around with speed a bit, iot get used to it. I personally only use it, to allign generated audio with reference audio, in case a speed of 1.0 produces errors with the provided reference audio. If you only want to increase the speed of the speaker (word frequency) just provide a reference audio, where with a faster speking speaker, if that makes sense.

This PR gives additional infos on the matter of speed/duration prediction (wasn't pulled in, as it would depend on the model inn use and add further processing overhead): https://github.com/SWivid/F5-TTS/pull/363

mame82 commented 3 days ago

I agree thatsome more info or "best practices" regarding reference audio should be added to the repo, to clarify some of these things (how speaker speed, pauses between words etc. influence the generated audio). Yet, this could easily be find out with experimentation. Use the gradio inference interface. It provides a "speed" slider iirc and you would also be able to experiment with impact of differences in reference audio, by recording your own voice as reference, right from the gradio interface.

WGS-note commented 3 days ago

Thank you for your reply, this was very useful for me! I would also like to ask how to control the speed of speech? Is it based on time delay?

SWivid commented 3 days ago

Many thanks @mame82 ~

how to control the speed of speech

Hi @WGS-note . Our non-autoregressive model is given a total duration to predict the audio (melspectrogram actually) output. So if you have 10 words and order a 10-second for the model to generate, the model itself will decide how to distribute each word's position and duration. If you have the same 10 words but order an 8-second for the model, it is the same but obviously each word's duration is shorter which sounds faster in speaking rate.

The speed slider is just to do the scaling stuff, and you could simply set 0,8 if you want a result with 0.8x speaking rate, or 1.2x if want faster.

For acceleration, some engineering work can help speed up in magnitude (TensorRT, etc.), though we are not familiar with it. That is also the reason we are open-sourcing, in wish that we could make the TTS technique better with whole community's efforts.

WGS-note commented 3 days ago

Thank you very much for your reply! I tried it and it worked very well, because I was using librosa.effects.time_stretch before and the synthesized audio sounded weird, the speed parameter solved my problem very well.

mame82 commented 3 days ago

Maybe an analogy helps to understand how the model behaves and how you take control of things like speed (it is not exact, but it helped me to get a rough understanding of the concepts). Think of this model as an image generation model with the purpose of inpainting/outpainting.

You provide a partial image as input (which is your reference audio). This image is describe by a prompt (your reference text), but the image is not complete. Let's further assume the right part of the image is missing. So you add a prompt describing the missing part. Say you provide an image of a checker board for example (let's say 2x4 fields) and the description "The left half of a checkerboard" (analogy to the reference text). As generation text for the missing part of the image you provide "and the right half". Now during inference, the following happens: The model concatenates the reference text and your generation text into "The left half of a checkerboard" + "and the right half". Next it needs to know how large the missing part of the image has to be based on your inputs (the reference text and generation text are roughly of the same length, so it is assumed that the resulting image will have roughly 2x the width of the reference image - this corresponds to the estimated duration, which you modify via the speed value). Now the model fills the missing part of the overall image (with predicted width), based on the concatenated prompt and the partial image input. As output it would likely come up with an image of a full checkerboard (4x4 fields). The final step would be to trim away the input part, so the left part of the image gets removed (== reference audio) and on the newly generated right part is returned (generated audio), and exactly fits the predicted image width minus the removed input (predicted audio duration). The resulting right part of your image, should match your generation prompt which was: "and the right half".

Now if you change the speed of the audio generation to say to 0.5 the predicted image width would be devided by this value (effectively doubles the width), thus the model would have to fill more room based on the same input image. For a checkerboard this means the generated checkerboard fields have to be streteched in order to fill the space (if you think of checker fields being words in the - now longer audio duration - this explains why speech gets slower). The problem is: If the values get to extrem, the model could add in repeated patterns of checkerboard fields from the input image (corrsponds to wrong/repeated words, stuttering for audio). If you just want more checkerboard fields in the generated parts of the image (corresponds higher word rate aka faster speech), it would be better to provide an input image, with more checkerboard fields (corresponds to faster spoken reference audio). Now if the implementation predicts the wrong output size for the generated content, in such a way that stuttering occurs or words are extensively stretched, this is where the speed value comes in to compensate for the mispredicted output duration. Son in that sense, the speed argument influences word frequency more or less, but is not the best way to tune speed of speech (maybe it should be renamed to something like "duration_scale").

The better way to make speed of speech higher is to provide input audio with higher word rate. This also impacts generation speed. Say you provide two times input audio, bot with same reference text and spoken words, but the second one is spoken twice as fast. In result, calculated overall output duration would be half as long (think of an input image of half the size for the analogy). Shorter input+output duration also means shorter processing time and thus faster inference.

Hope that comparison is of help, for thus having similar issues

mame82 commented 3 days ago

The analogy with input genration from above, also helps to understand other discussed aspetcs:

no streaming of chunks: The model cant hand you back chunks of the "generated image" only the image as a whole (referenced audio + generated audio, based on reference text + generation text)
if your generation text is too long, say "genrate the right half of the checker board and append more and append even more" it will basically split into multiple "image generations" anf or each "sub generation" the reference image is reused to generate the missing image part with a slice of the generation text (run 1: "generate the right half of the checkerboard", run 2: "and append more", run 3: "and append even more"). The generated parts of run 1 to 3 are then combined and returned once all generations are runs
the analogy to the "step count" of image generation models is "nfe steps", you could lower the to reduce processing time (half nfe steps mean rougly half the processing time) but the result is less defined (bad audio quality) ... going with too many nfe steps on the other hand, brings no noticable improvement of output, while consuming addtional processing time (thats why suggested values for nfe steps is 32 or 64, trade off is speed vs output quality) - depending on the use case, could be fine f.e. dynamic nfe setting (first sentence with low nfe step count, succesive sentences with higher nfe steps, while audio is already playing).

Hope these comments are also helpful for others.

SWivid / F5-TTS

Can inference be accelerated and implemented in real time? #544

Checks

Question details