elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.26k stars 90 forks source link

Support Text to Speech #209

Open zolrath opened 1 year ago

zolrath commented 1 year ago

Hello! As Speech to Text models such as Whisper are added having access to some of the impressive AI Text to Speech models would be a nice way to close the loop!

My current suggestion for a model to support would be bark.

fredwu commented 1 year ago

+1

Would also love to see the support for Coqui TTS.

bartekupartek commented 5 months ago

It would be great to run Bark in Elixir, also recently this TTS model brought a lot of attention https://github.com/collabora/WhisperSpeech

Jdyn commented 3 months ago

I hate to reiterate what's already been said but TTS in Bumblebee using Bark would be super valuable. Any chance of supporting it?

Hugging face: https://huggingface.co/suno/bark

josevalim commented 3 months ago

Pull requests are always welcome. Starting with one of the models in Hugging Face Transformers is probably the easiest way to get started: https://huggingface.co/docs/transformers/en/tasks/text-to-speech

nickkaltner commented 3 months ago

Just adding this as an interesting model to support too https://huggingface.co/coqui/XTTS-v2

bartekupartek commented 2 months ago

I tried to port Bark and later on WhisperSpeech, they use multiple models to convert text to semantics, semantics to audio and encode... anyway there are more promising models recently released https://huggingface.co/parler-tts/parler_tts_mini_v0.1 or https://github.com/jasonppy/VoiceCraft or https://github.com/myshell-ai/OpenVoice After reviewing their architectures they might be easier to integrate

michelson commented 2 months ago

@bartekupartek, do you have your implementation open? I'm trying to do the same I've read the docs but not sure where to start.

bartekupartek commented 2 months ago

@michelson not yet but working on it, this models aren't using standard layers or if at all they are in pickle format, I needed to move back to understand simpler models with axon first

bartekupartek commented 2 months ago

I'm currently playing around Tacotron 2 text-to-speech and since it's simplest TTS I've found I'm trying to reproduce it in Elixir, I used nx_signal to process audio files and generate Mel spectrograms but during my research I noticed there is no support for a vocoder in Elixir ecosystem to convert spectrograms back to audio or am I missing something? Vocoders are typically another models so I think they could be integrated in bumblebee. I found all TTS models are utilizing vocoders to encode audio from theirs outputs, but they are yet another layer of complexity.

josevalim commented 2 months ago

Correct. We would need to implement them in Elixir. Maybe @polvalente knows of an implementation that could be ported, otherwise we need to look if there are any Jax implementations. If not, maybe it needs to be a separate library we invoke.

polvalente commented 2 months ago

There are many kinds of vocoders. I think the best way to approach this would be to choose a specific model we want to support and work towards porting the one it uses.

bartekupartek commented 2 months ago

I was thinking it might be one of torchaudio vocoders like Griffin-Lim(outputs sounds robotic) or WaveRNN(most likely this) or Nvidia Waveglow to turn mel spectograms into audio, but I just read trough VALL-E paper Bark is based on:

We propose VALL-E, the first TTS framework with strong in-context learning capabilities as GPT-3, which treats TTS as a language model task with audio codec codes as an intermediate representation to replace the traditional mel spectrogram

It would be fun to have Tacotron 2 working end to end or hear how mel spectrograms sounds but it looks like it doesn't make sense for any recent models mentioned above that are using facebook/encodec to turn outputs into audio codes directly :bowing_man: