Does it able to integrate a LLM to train e2e?

SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"

https://arxiv.org/abs/2410.06885

MIT License

176 stars 15 forks source link

Does it able to integrate a LLM to train e2e? #10

Closed luohao123 closed 1 hour ago

luohao123 commented 14 hours ago

I am thinking of using whisper-large-v3 and F5-TTS stack, to make a e2e voice 2 voice model, for tss part, do u think it possiable intergate F5 with an LLM?

This means, it should be streamable.

SWivid commented 14 hours ago

Such system is very meaningful. F5-TTS cannot do 'Truly' steamable generation at present, but somehow in a sequence.

That's part of the reason we're open sourcing, to move faster in this direction with the community

luohao123 commented 8 hours ago

Is because of the flow-matching way in generation?

SWivid commented 8 hours ago

Is because of the flow-matching way in generation?

for all I know, compare to the naturally streamable next-token prediction modeling, a compromise may be 'just chunk the generation' for diffusion models. Or a brand new training task (enabling streaming) in stead of in-filling is required

luohao123 commented 7 hours ago

What potential approaches might F5 evolve through if we have the objective of integrating it into a Large Language Model?

SWivid commented 7 hours ago

In my opinion, integrating 'into' a LLM might be more of the way Seed-TTS_ICL does, or MELLE, ARDiT-TTS, etc. To integrating 'with' a text LLM, we are trying to make F5 streamable first.

luohao123 commented 5 hours ago

The later is what I mean, using an adaptor or something concat togther, some work already showed promising progress, like MiniOmni.

In your mind, what's the first step could be to make F5 streamable

SWivid commented 5 hours ago

We will keep on open-sourcing if once done something concrete w.r.t. this. And it would be welcomed if you could share your insights ~