Closed luohao123 closed 1 hour ago
Such system is very meaningful. F5-TTS cannot do 'Truly' steamable generation at present, but somehow in a sequence.
That's part of the reason we're open sourcing, to move faster in this direction with the community
Is because of the flow-matching way in generation?
Is because of the flow-matching way in generation?
for all I know, compare to the naturally streamable next-token prediction modeling, a compromise may be 'just chunk the generation' for diffusion models. Or a brand new training task (enabling streaming) in stead of in-filling is required
What potential approaches might F5 evolve through if we have the objective of integrating it into a Large Language Model?
In my opinion, integrating 'into' a LLM might be more of the way Seed-TTS_ICL does, or MELLE, ARDiT-TTS, etc. To integrating 'with' a text LLM, we are trying to make F5 streamable first.
The later is what I mean, using an adaptor or something concat togther, some work already showed promising progress, like MiniOmni.
In your mind, what's the first step could be to make F5 streamable
We will keep on open-sourcing if once done something concrete w.r.t. this. And it would be welcomed if you could share your insights ~
I am thinking of using whisper-large-v3 and F5-TTS stack, to make a e2e voice 2 voice model, for tss part, do u think it possiable intergate F5 with an LLM?
This means, it should be streamable.