collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
3.84k stars 208 forks source link

Semantic -> acoustic modeling #4

Closed jpc closed 9 months ago

jpc commented 1 year ago

We got #3 working so now it's time to try to convert from Whisper-based semantic tokens (#3) to EnCodec-based acoustic tokens (#2).

We found out that better semantic tokens (from Whisper medium) make this task a lot easier and even tiny models sound great. Multilingual semantic token training helps and cross-language voice cloning works great.

There are a couple of hypothesis to test:

We also still have a couple of engineering challenges:

jpc commented 1 year ago

I pushed the first version of the semantic to acoustic modeling based on the Whisper transformer model but it does not train so I am probably still having some bugs somewhere. I'm going to create a synthetic dataset and debug it like I did the quantization bottleneck.

jpc commented 1 year ago

I found some bugs in the code and now it trains successfully:

  1. Overfits quickly on 2 hrs of speech
  2. Trains without overfitting on my 160hr single-speaker dataset

The performance is still not great but it's a step in the right direction. :) It's still based on the old VQ/RQ tokens so this should help a bit (see #3).

I also experimented with using Whisper embeddings directly (without quantization) and it works. It allowed me to easily experiment with extracting the embeddings from other layers of the encoder. Seems promising to balance the difficulty of the translation tasks between text and semantic tokens vs. semantic tokens and acoustic tokens. For reference in SPEAR TTS the semantic to acoustic task was a lot easier (they used a decoder-only model with 12 layers, about the size of Whisper Base) than the text to semantic task (T5-Large – 24 layer encoder + 24 layer decoder, the exact same size as Whisper Medium).

So right now we will focus on trying to understand the balance between these two tasks.

jpc commented 1 year ago

I've trained a new S->A model, fixed the autoregressive sampling and it started generating some recognizable speech.

There is some serious bug (it generates only the first 10 seconds, everything afterwards is noise) but the common phrases ("This is a LibraVox recording", "Gentleman") already sound quite good (modulo the quality of the EnCodec speech codec at 1.5kbps). Once I figure out this bug it should start training a lot easier so I expect a big jump in quality on my next update. :)

jpc commented 1 year ago

I fixed the 10 second generation bug (it was a bug in the sampling code). I also found out that lowering multinomial sampling temperature to 0.8 improves the quality quite a lot.

I also trained another model, replacing cross-attention with adding the rescaled encoder features to the input of the middle layer of the decoder (both are sampled at a fixed rate so we don't need to learn to map one to the other) and got pretty good quality:

https://user-images.githubusercontent.com/107984/229446991-2b0cbcff-24ab-4423-9776-e245d39bdb3c.mov

jpc commented 1 year ago

Oh, I forgot to mention that the new PyTorch 2.0 optimized attention implementation is amazing. With a very simple replacement I got 4x speedup on an A100.

EmbraceAir commented 1 year ago

Hi @jpc, thanks for this excellent work! I have a small question about the semantic to acoustic model. I notice that you set unique as False in ur data loader, which is different from the paper. Will the semantic tokens contain prosodic information of speech?

By the way, the above audio result comes from the "3. Semantic to acoustic token modeling.ipynb" or the "3B *.ipynb"? Could you provide some pre-trained models?

Thanks

jpc commented 9 months ago

Yup, our semantic tokens also carry prosody information. This makes the S2A models job easier and the overall solution faster. This means that prosody cannot be changed with voice cloning.

The newest samples (in the README) sound a lot better.