collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
3.62k stars 193 forks source link

5. Improve the EnCodec speech quality #10

Closed jpc closed 6 months ago

jpc commented 1 year ago

Right now the EnCodec speech quality at 1.5kbps is pretty terrible (far from what Google shows for their SoundStream-based codec). I am pretty sure the problem is caused by EnCodec being a universal sound codec because the official samples for SoundStream at 1.5kbps sound quite similar (Lyra-v2 sound even worse than that). That's why I suspect SPEAR TTS is based on an unreleased speech-only codec.

Since EnCodec has multi-rate capability so the overall model knows how to represent high-quality speech. The pretty good results we had for compressing the Whisper embeddings suggest we might get away with retraining just the quantization layer to reprioritize the bandwidth allocation and improve speech quality (at the cost of ignoring music and other audio).

rishikksh20 commented 1 year ago

Hi, I am new to Vector quantization-based TTS and VC, and I am planning to use your code for prompt based VC, so I only required to train S-A network but I need to train with 6 kbps encodec. Can we train S->A network on 6 kbps EnCodec token, without changing any architecture or I have to change anything to make support it to 6 kbps. I can train directly on my server if I can directly train S-A network on 6 kbps EnCodec token..

jpc commented 1 year ago

Hi, your question arrived just in time :). I just finished training a 3kbps S->A model and I'll push it today.

The downside is that the training is a bit more than 2x slower and it seems we need 2x more epochs to get good results. I trained it for 15 epochs (1.5x more than the 1.5kbps model) and it took me 15 hours on a single A100. So in the end it has a lot better audio quality but is not yet as good as the 1.5kbps model in terms of speech legibility (it sometimes mumbles).

We could train on 6kbps directly if we reduced the window size to 15s. But before that I would love to check if we can fine-tune the EnCodec bottleneck only on speech since SPEAR TTS samples showed great speech quality in merely 1.5kbps.

Could you share more about your "prompt based VS" idea? Sounds interesting.

rishikksh20 commented 1 year ago

Idea is "Prompt based Any to Any Voice Conversion" as it's Voice to Voice problem we don't required T->S model. Just needed a Voice to Sematic token model like HuBert, VQ-Wav2Vec or may be Whisper encoder etc who removes voice and expression component from input audio files and preserve only semantic component, we then these model to extract semantic token from Source speaker voice and similar to spear-tts append that with Target voice's acoustic token and passed to the S->A model and this way we can copy voice of target speaker with expression to the source speaker's speech. And then passed that to Encodec. This way we can create language independent Any to Any Voice Conversion. We don't need to deal with text data so it's easy to train as well as we can train this model.

Another interesting use case of this model on low resource languages as Vall-e or Spear-TTS required lots of data to get perfect results but if some language have low amount of text and speaker's data then we can use TTS + Voice conversion to achieve voice cloning. We can use traditional TTS methods to clone any person's voice in target language and then pass that generated audio to VC module to paste target speaker voice over that, we can use Google TTS service as it that supports many languages to get text to speech and then pass that speech to VC module to paste target speaker's expression and voice.

rishikksh20 commented 1 year ago

@jpc Suno just open sourced Spear-TTS like model : Bark Model : https://github.com/suno-ai/bark/blob/main/model-card.md They are trained model on 6kbps Encodedc.

jpc commented 1 year ago

@rishikksh20 Thanks for the explanation. Yeah, that sounds great and we definitely support that with our S->A model.

One problem I see with this is I think semantic tokens have to be trained with speech from the target language, otherwise it may lack the important phoneme representations. I have to test that with my current implementation.

Thanks for the suno-ai link. Cool model, and they are a bit ahead of us (and have a better dataset). OTOH we have not scaled our approach yet (I trained 2 layer enc-dec models (4 total) vs. their 12 layer decoder-only models) and we released the full training and data preparation code and not just the final inference model. I love that they verified an idea I also considered – to predict the 6kbps EnCodec tokens from the low-quality token stream (0.75kbps in their case).

rishikksh20 commented 1 year ago

@jpc Yes sematic tokenizer needed to train a multi-lingual way otherwise it will not have good pronunciation.

Also for Speech Codec : https://github.com/yangdongchao/AcademiCodec I found this repo which might be helpful.

jpc commented 1 year ago

I trained 3 encoder-only transformers for quality enchancement. (https://github.com/collabora/spear-tts-pytorch/commit/82902a69826e14f819f56f3ef2612ff9f6fd6ec8)

They work perfectly on the training data, I can also fine-tune them on high-quality recordings from LibriTTS and they are even better, but they completely fail to transfer to the domain of tokens generated by the autoregressive S2A model.

I think if you read the recent Google and Microsoft papers carefully they are noticing this train/inference mismatch as well.

I plan to look into the SoundStorm architecture since it quite similar in principle to how I designed my A2A enchancement models but since it would be single stage S2A it should solve the issues I am encountering.

rishikksh20 commented 1 year ago

Hi @jpc ,

I have implemented Soundstorm : https://github.com/rishikksh20/SoundStorm-pytorch . That one place I am struggling is the semantic token part, I looked at HuBert and VQ-Wav2Vec but those are trained on 16 khz data and generates 50 tokens per second where as for acoustic token I am using Encodec which is trained on 24 khz and have 75 tokens per second, so there are mis-match between them. So, I am planning to use your Whisper based Semantic token network. So what's your thought on that?

jpc commented 1 year ago

@rishikksh20 I would love to collaborate on this part. I have semantic and acoustic tokens extracted for a large part of LibriLight (and whole LibriTTS) that we could use for training.

I also have 50 semantic tokens per second and 75 acoustic timesteps. Previously I used 150 acoustic tokens (2 quantization levels serialized) so I just repeated each semantic token 3 times. This is going to be a little bit more tricky. We can probably get away with inserting a dummy padding token between every 2 semantic tokens to align them.

rishikksh20 commented 1 year ago

Yeah it's sound bit complicated to me, I think 50 semantic tokens per second would be fine but for acoustic tokens we can use Encodec (24 khz + 240 Hop size) from here with 100 token/sec , they also provide trained checkpoint here and as per author this config Encodec sound better than orig. Encodec and Soundstream. We just repeat each semantic token to acheive 100 token/sec. image

What's your thought on that ?

rishikksh20 commented 1 year ago

@jpc I have completed my implementation of SoundStorm. Now we can directly drop in SoundStorm in this repo but only problem I think remain is sampling method.

jpc commented 1 year ago

@rishikksh20 Hey, I was thinking about how we could collaborate on this. Two weeks ago I did try to train SoundStorm but had trouble with the @lucidrains implementation and yours wasn't finished yet.

I have semantic and acoustic tokens (EnCodec, 8 quantizers) for 1300 hours of a single speaker extracted from LibriLight. Do you think it would make sense for you to try and train your SoundStream implementation on that? I also have an 8k hours multi-speaker subset and I am working on processing the rest of the LibriLight dataset.

jpc commented 1 year ago

Regarding the sampling – I ran some tests with a simple auto-regressive model and this 3/2 padding method (insert one padding token in between two source tokens) worked quite well:

        # converts 50 toks/s to 75 toks/s by adding padding between every two tokens
        b,n = Stoks.shape
        x = Stoks.reshape(b,n//2,2)
        x = x.repeat_interleave(2, -1)[:,:,:3]
        x[:,:,1] = 1024
        x = x.reshape(b,n//2*3)
        return self.semantic_embedding(x.to(torch.long))
rishikksh20 commented 1 year ago

Hi @jpc , Hope you are doing well! I am completed the implementation of SoundStorm, I have created a dataloader as per 50 toks/s to 100 toks/s . I can also adopt your suggestion for 50 toks/s to 75 toks/s but I needed to test first weather my code is working or not. How can you share your data with me, I can train on my servers?

jpc commented 1 year ago

I was thinking about uploading the dataset to Huggingface tomorrow. Would that work for you?

rishikksh20 commented 1 year ago

yeah sure

rishikksh20 commented 1 year ago

@jpc have you uploaded the data ?

jpc commented 1 year ago

@rishikksh20 Hey, I am working on it right now. I got a bit sick last week and also underestimated the time it takes to clean it, test it and compress/upload the whole thing it. I ended up having almost 1TB of uncompressed EnCodec tokens (6kbps).

I should finish it today. Do you have a Discord account or some other app where we could sync?

rishikksh20 commented 1 year ago

@rishikksh20 Hey, I am working on it right now. I got a bit sick last week and also underestimated the time it takes to clean it, test it and compress/upload the whole thing it. I ended up having almost 1TB of uncompressed EnCodec tokens (6kbps).

I should finish it today. Do you have a Discord account or some other app where we could sync?

My Discord username is rishikksh20

jpc commented 6 months ago

Hey, this was solved by combining several approaches: using the Vocos vocoder, training on data with 4 Encodec quantizers and implementing MusicGen-like time-shifting to cut the sequence length.

Overall this made the voice quality is very nice.