Open rishikksh20 opened 4 months ago
Hey, i am mostly reproducing the paper VoiceBox from meta. I don't have llm now, it is just a transformer that translates phonemes to sound + duration model that predicts number of audio segments for a phoneme. LLM might appear later to emit phonemes and durations.
I am interested in Duration predictor, Fastspeech duration predictor is quite naive and not able to model expressive prosody. I would prefer autoregressive duration predictor with Gaussian Up sampling for expressive natural sounding speech. Do you have any thought on duration prediction, or you did any experimentation for same?
I also didn't like duration predictor, but i am blaming my dataset or data is too simple to train. I feel that some kind of context is needed to properly train duration network
Completely agreed, I think Naturalspeech 2's duration predictor which takes prompt and do cross attn between prompt feature to text feature is one of the good ways to predict duration as it's considered input voice and prosody from prompt and linguistic feature from text. New paper from Microsoft : https://arxiv.org/pdf/2402.07383.pdf also based on Voicebox like architecture and same duration predictor
Nice paper! Confirms my feeling that this models are the future, but we want to adjust more and more features. Honestly i am playing around with vocoders right now, i have tested vocos and hifigan (both training them from scratch) and only hifigan works well for me, i am also trying to upsample from 16khz to 24khz in such vocoders. All papers are confusing since they claim to outperform hifigan, but in my tests hifigan converges reliably and outperforms other models.
I have extreme level of expertise in vocoders, I have implemented approx all good GAN based vocoders, hifigan-v1 and univnet are the best ever I encountered. Another vocoder named FreGAN also performed equal or better in sometimes compared to hifigan, but it depends upon data to data. Some vocoders are noise robust, some vocoders generalize better, some perform good with large data, some perform good on small one, some is good for finetuning. Overall, on average hifigan-v1 and univnet are the best, vocos is good but only when you trained on high volumne diverse data. I prefer to use this: https://github.com/rishikksh20/iSTFTNet-pytorch as its easily converged and trained on small amount of data but sometimes have mid tone frequency lines which hurt the quality otherwise it's as good as hifigan-v1 and 2x fast. For your use case hifigan-v1 will be best.
I just tried the Vocos and it turns crisp voice to a dull one. This is exactly an effect i am trying to avoid. My current goal is to raise bar for quality and i think that the first low hanging fruit is to make voice crisp first, then natural.
Have you tried this one? https://github.com/sony/bigvsan their demo page is weird, but they trained it further and i just tested and it performed really well.
BigVSAN and BigVGAN both are good but I'm not sure are they crisper, because I have also struggle lot to find crisp vocoders.
I have tested BigVSAN and i am really impressed. They also the only one team that published weights that were trained for 10m iterations instead of 1m, therefore i am using them now and i have prepared nice repo for easier to use: https://github.com/ex3ndr/supervoice-vocoder
You can see how nice it's quality: Source: https://github.com/ex3ndr/supervoice-vocoder/blob/master/sample.wav Re-copy: https://github.com/ex3ndr/supervoice-vocoder/blob/master/resynth.wav
https://github.com/ex3ndr/supervoice/blob/1bb4a32f0628afd57e909257bb0be29362c9fdc2/supervoice/model.py#L24
update this with your vocoder as model_vocoder
file is not there.
Hi @ex3ndr , checked your latest commit on duration predictor. Have you trained duration predictor ?
In the process here: https://github.com/ex3ndr/supervoice-gpt It is phonemizer + duration model in one
I am also planing to implement same
Are you treating phoneme duration as a classification task as phoneme duration is discrete value not continuous and more or less, they range between 0 to 50 at max?
I treat them as a normal token 0-100 duration.
For some reason it feels too fast somehow, I don't understand why. Do you have similar experience?
if you use a standard token and predict token, you treat it as a classification task which I also support. It should be fast because I don't think it is a complicated task for the model. I have a thought that if we pass voice prompt along with text for prosody modeling because duration will be part of prosody.
No, i mean the phonemes are kind of feel too fast (short), comparing to human-generated ones. I feel that something is missing here.
Yes, when you predict duration using duration predictor it always come fast no matter what, in some case it comes out normal. One way to tackle this problem is to use MoE based duration predictor like in this paper: https://arxiv.org/pdf/2107.02530.pdf .
Interesting, but i am not convinced: 1) GPT learns full distribution, not the only optimal one 2) GPT samples durations, not predicts 3) GPT has inserted durations between words that are also sampled
It is just weirdly slow, i multiply by 1.1
to 1.2
and it works better, which is double weird because audio model is trained on 12.5ms tokens, but GPT is on 10ms one, which means that GPT is already emits longer tokens.
I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...
I might need to avoid two sequences in parallel and instead switch between duration and phoneme prediction to make duration dependent on phoneme...
Yes.
@ex3ndr Samples sounds decent 👍🏽
Some initial feedback:
-
for example, it take long pause between open and source while pronouncing open-source
.HTML
, CEO
etc.Otherwise, the voice sounds exactly like a human and very natural flow, amazing job 👍🏽 . Maybe training a bigger model with more variety of data will help to overcome the above issue.
Hi, just saw your repo, and bit confused regarding the architecture and philosophy behind you TTS model. Could please add little bit regarding your architecture, like you training LLM for TTS but you also don training for duration which seems something new as most Large model TTS rely on autoregressive model for duration itself.
Although I will go through your code and try to be figured it out myself.