Open nanowell opened 2 months ago
Yes, we are working on supporting it. @jctian98, can you respond to this?
@sw005320 sure.
Hi @nanowell , thanks for the question.
We are currently working on the SpeechLM module of ESPnet. You may check our SpeechLM branch if you are interested. Here is the doc: https://github.com/espnet/espnet/blob/speechlm/egs2/TEMPLATE/speechlm1/README.md I think it's similar to what you mentioned:
Could you further clarify what audio-to-audio / speech-to-speech tasks you are interested in? E.g., speech enhancement, separation or some other tasks? Btw, please also be aware that many speech/audio tasks don't stand alone. Instead, it may interact much with textual input/output, such as speech recognition / synthesis / translation, etc.
Please let us know how we can further help :)
Best,
@jctian98, thank you for your help with the documentation. I'm trying to model an audio continuation transformer with an audio-to-audio assistant in mind. As a result, I believe it will be capable of modeling speech enhancement, separation, or other tasks that would typically require specific pretraining (assuming sufficient scaling). I think autoregressive (AR) audio completion models, similar to vanilla AR language models, will be more versatile but also more challenging to pretrain.
@nanowell Thanks for the reply.
If you plan to do the pre-train by yourself, the current SpeechLM module should be a flexible and efficient candidate. We support some good features such as HuggingFace models integration and DeepSpeed.
Unfortunately, we don't have a pre-trained model to release at this stage. Will let you know if we get one later :)
Could you add a native speech to speech / audio-to-audio support with encoder (tokenizer) and decoder (back to audio waves)
I was able to implement a decoder only model, I first used audio codec tokenizer to tokenize dataset and train on it. It works unreliably and I can't figure out how to generate from voice inputs.
Do you have a plans for audio-to-audio support?