Architecture of Supervoice

rishikksh20 commented 4 months ago

Hi, just saw your repo, and bit confused regarding the architecture and philosophy behind you TTS model. Could please add little bit regarding your architecture, like you training LLM for TTS but you also don training for duration which seems something new as most Large model TTS rely on autoregressive model for duration itself.

Although I will go through your code and try to be figured it out myself.

ex3ndr commented 4 months ago

Hey, i am mostly reproducing the paper VoiceBox from meta. I don't have llm now, it is just a transformer that translates phonemes to sound + duration model that predicts number of audio segments for a phoneme. LLM might appear later to emit phonemes and durations.

rishikksh20 commented 4 months ago

I am interested in Duration predictor, Fastspeech duration predictor is quite naive and not able to model expressive prosody. I would prefer autoregressive duration predictor with Gaussian Up sampling for expressive natural sounding speech. Do you have any thought on duration prediction, or you did any experimentation for same?

ex3ndr commented 4 months ago

I also didn't like duration predictor, but i am blaming my dataset or data is too simple to train. I feel that some kind of context is needed to properly train duration network

rishikksh20 commented 4 months ago

Completely agreed, I think Naturalspeech 2's duration predictor which takes prompt and do cross attn between prompt feature to text feature is one of the good ways to predict duration as it's considered input voice and prosody from prompt and linguistic feature from text. New paper from Microsoft : https://arxiv.org/pdf/2402.07383.pdf also based on Voicebox like architecture and same duration predictor

ex3ndr commented 4 months ago

Nice paper! Confirms my feeling that this models are the future, but we want to adjust more and more features. Honestly i am playing around with vocoders right now, i have tested vocos and hifigan (both training them from scratch) and only hifigan works well for me, i am also trying to upsample from 16khz to 24khz in such vocoders. All papers are confusing since they claim to outperform hifigan, but in my tests hifigan converges reliably and outperforms other models.

rishikksh20 commented 4 months ago

I have extreme level of expertise in vocoders, I have implemented approx all good GAN based vocoders, hifigan-v1 and univnet are the best ever I encountered. Another vocoder named FreGAN also performed equal or better in sometimes compared to hifigan, but it depends upon data to data. Some vocoders are noise robust, some vocoders generalize better, some perform good with large data, some perform good on small one, some is good for finetuning. Overall, on average hifigan-v1 and univnet are the best, vocos is good but only when you trained on high volumne diverse data. I prefer to use this: https://github.com/rishikksh20/iSTFTNet-pytorch as its easily converged and trained on small amount of data but sometimes have mid tone frequency lines which hurt the quality otherwise it's as good as hifigan-v1 and 2x fast. For your use case hifigan-v1 will be best.

ex3ndr commented 4 months ago

I just tried the Vocos and it turns crisp voice to a dull one. This is exactly an effect i am trying to avoid. My current goal is to raise bar for quality and i think that the first low hanging fruit is to make voice crisp first, then natural.

Have you tried this one? https://github.com/sony/bigvsan their demo page is weird, but they trained it further and i just tested and it performed really well.

rishikksh20 commented 4 months ago

BigVSAN and BigVGAN both are good but I'm not sure are they crisper, because I have also struggle lot to find crisp vocoders.

ex3ndr commented 4 months ago

I have tested BigVSAN and i am really impressed. They also the only one team that published weights that were trained for 10m iterations instead of 1m, therefore i am using them now and i have prepared nice repo for easier to use: https://github.com/ex3ndr/supervoice-vocoder

You can see how nice it's quality: Source: https://github.com/ex3ndr/supervoice-vocoder/blob/master/sample.wav Re-copy: https://github.com/ex3ndr/supervoice-vocoder/blob/master/resynth.wav

rishikksh20 commented 4 months ago

https://github.com/ex3ndr/supervoice/blob/1bb4a32f0628afd57e909257bb0be29362c9fdc2/supervoice/model.py#L24 update this with your vocoder as model_vocoder file is not there.

rishikksh20 commented 4 months ago

https://arxiv.org/pdf/2402.12208.pdf

rishikksh20 commented 4 months ago

Hi @ex3ndr , checked your latest commit on duration predictor. Have you trained duration predictor ?

ex3ndr commented 4 months ago

In the process here: https://github.com/ex3ndr/supervoice-gpt It is phonemizer + duration model in one

rishikksh20 commented 4 months ago

I am also planing to implement same

rishikksh20 commented 4 months ago

Are you treating phoneme duration as a classification task as phoneme duration is discrete value not continuous and more or less, they range between 0 to 50 at max?

ex3ndr commented 4 months ago

I treat them as a normal token 0-100 duration.

For some reason it feels too fast somehow, I don't understand why. Do you have similar experience?

rishikksh20 commented 4 months ago

if you use a standard token and predict token, you treat it as a classification task which I also support. It should be fast because I don't think it is a complicated task for the model. I have a thought that if we pass voice prompt along with text for prosody modeling because duration will be part of prosody.

ex3ndr commented 4 months ago

No, i mean the phonemes are kind of feel too fast (short), comparing to human-generated ones. I feel that something is missing here.

rishikksh20 commented 4 months ago

Yes, when you predict duration using duration predictor it always come fast no matter what, in some case it comes out normal. One way to tackle this problem is to use MoE based duration predictor like in this paper: https://arxiv.org/pdf/2107.02530.pdf .

ex3ndr commented 4 months ago

Interesting, but i am not convinced: 1) GPT learns full distribution, not the only optimal one 2) GPT samples durations, not predicts 3) GPT has inserted durations between words that are also sampled

It is just weirdly slow, i multiply by 1.1 to 1.2 and it works better, which is double weird because audio model is trained on 12.5ms tokens, but GPT is on 10ms one, which means that GPT is already emits longer tokens.

I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...

rishikksh20 commented 4 months ago

I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...

Yes.

rishikksh20 commented 3 months ago

@ex3ndr Samples sounds decent 👍🏽

rishikksh20 commented 3 months ago

Some initial feedback:

Issue with special characters like - for example, it take long pause between open and source while pronouncing open-source.
Issue while pronouncing Abbreviated words like HTML , CEO etc.

Otherwise, the voice sounds exactly like a human and very natural flow, amazing job 👍🏽 . Maybe training a bigger model with more variety of data will help to overcome the above issue.

ex3ndr / supervoice

Architecture of Supervoice #1