espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.44k stars 2.18k forks source link

Speech-to-Speech/Audio-to-Audio support #5871

Open nanowell opened 2 months ago

nanowell commented 2 months ago

Could you add a native speech to speech / audio-to-audio support with encoder (tokenizer) and decoder (back to audio waves)

I was able to implement a decoder only model, I first used audio codec tokenizer to tokenize dataset and train on it. It works unreliably and I can't figure out how to generate from voice inputs.

Do you have a plans for audio-to-audio support?

sw005320 commented 2 months ago

Yes, we are working on supporting it. @jctian98, can you respond to this?

jctian98 commented 2 months ago

@sw005320 sure.

Hi @nanowell , thanks for the question.

We are currently working on the SpeechLM module of ESPnet. You may check our SpeechLM branch if you are interested. Here is the doc: https://github.com/espnet/espnet/blob/speechlm/egs2/TEMPLATE/speechlm1/README.md I think it's similar to what you mentioned:

Could you further clarify what audio-to-audio / speech-to-speech tasks you are interested in? E.g., speech enhancement, separation or some other tasks? Btw, please also be aware that many speech/audio tasks don't stand alone. Instead, it may interact much with textual input/output, such as speech recognition / synthesis / translation, etc.

Please let us know how we can further help :)

Best,

nanowell commented 2 months ago

@jctian98, thank you for your help with the documentation. I'm trying to model an audio continuation transformer with an audio-to-audio assistant in mind. As a result, I believe it will be capable of modeling speech enhancement, separation, or other tasks that would typically require specific pretraining (assuming sufficient scaling). I think autoregressive (AR) audio completion models, similar to vanilla AR language models, will be more versatile but also more challenging to pretrain.

jctian98 commented 2 months ago

@nanowell Thanks for the reply.

If you plan to do the pre-train by yourself, the current SpeechLM module should be a flexible and efficient candidate. We support some good features such as HuggingFace models integration and DeepSpeed.

Unfortunately, we don't have a pre-trained model to release at this stage. Will let you know if we get one later :)