Open zolrath opened 1 year ago
It would be great to run Bark in Elixir, also recently this TTS model brought a lot of attention https://github.com/collabora/WhisperSpeech
I hate to reiterate what's already been said but TTS in Bumblebee using Bark would be super valuable. Any chance of supporting it?
Hugging face: https://huggingface.co/suno/bark
Pull requests are always welcome. Starting with one of the models in Hugging Face Transformers is probably the easiest way to get started: https://huggingface.co/docs/transformers/en/tasks/text-to-speech
Just adding this as an interesting model to support too https://huggingface.co/coqui/XTTS-v2
I tried to port Bark and later on WhisperSpeech, they use multiple models to convert text to semantics, semantics to audio and encode... anyway there are more promising models recently released https://huggingface.co/parler-tts/parler_tts_mini_v0.1 or https://github.com/jasonppy/VoiceCraft or https://github.com/myshell-ai/OpenVoice After reviewing their architectures they might be easier to integrate
@bartekupartek, do you have your implementation open? I'm trying to do the same I've read the docs but not sure where to start.
@michelson not yet but working on it, this models aren't using standard layers or if at all they are in pickle format, I needed to move back to understand simpler models with axon first
I'm currently playing around Tacotron 2 text-to-speech and since it's simplest TTS I've found I'm trying to reproduce it in Elixir, I used nx_signal
to process audio files and generate Mel spectrograms but during my research I noticed there is no support for a vocoder in Elixir ecosystem to convert spectrograms back to audio or am I missing something?
Vocoders are typically another models so I think they could be integrated in bumblebee. I found all TTS models are utilizing vocoders to encode audio from theirs outputs, but they are yet another layer of complexity.
Correct. We would need to implement them in Elixir. Maybe @polvalente knows of an implementation that could be ported, otherwise we need to look if there are any Jax implementations. If not, maybe it needs to be a separate library we invoke.
There are many kinds of vocoders. I think the best way to approach this would be to choose a specific model we want to support and work towards porting the one it uses.
I was thinking it might be one of torchaudio vocoders like Griffin-Lim(outputs sounds robotic) or WaveRNN(most likely this) or Nvidia Waveglow to turn mel spectograms into audio, but I just read trough VALL-E paper Bark is based on:
We propose VALL-E, the first TTS framework with strong in-context learning capabilities as GPT-3, which treats TTS as a language model task with audio codec codes as an intermediate representation to replace the traditional mel spectrogram
It would be fun to have Tacotron 2 working end to end or hear how mel spectrograms sounds but it looks like it doesn't make sense for any recent models mentioned above that are using facebook/encodec to turn outputs into audio codes directly :bowing_man:
+1
Hello! As Speech to Text models such as Whisper are added having access to some of the impressive AI Text to Speech models would be a nice way to close the loop!
My current suggestion for a model to support would be bark.