X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
385 stars 29 forks source link

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Open Irvingao opened 1 month ago

Irvingao commented 1 month ago

🚀 The feature, motivation and pitch

As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:

  1. Speech to Text: Can we have a try by combining the pretrained ASR encoder and a trainable linear projection to make Speech to Text possible?
  2. Speech to Speech: Align the pretrained ASR decoder with the main LLM backbone.

Alternatives

No response

Additional context

No response

byrTony-Frankzyq commented 1 month ago

For your first idea, I think the asr example have done it.

Irvingao commented 1 month ago

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

byrTony-Frankzyq commented 1 month ago

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right?

Irvingao commented 1 month ago

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right? Though not fully understand

Exactly.

zszheng147 commented 1 month ago

Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.

We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.

If you have any further questions or need additional assistance, feel free to ask!