NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.79k stars 2.46k forks source link

How to train a spectrol-codec needed for AR TTS model? #10254

Closed JohnHerry closed 2 weeks ago

JohnHerry commented 1 month ago

Hi, I have read the paper "SPECTRAL CODECS: SPECTROGRAM-BASED AUDIO CODECS FOR HIGH QUALITY SPEECH SYNTHESIS" and the paper take HiFiGAN + FSQ as the mel-spec-codec, it takes 8 codebooks and each with 1000 codebook_size, and I think it can not be directly used as codec in AutoRegressive LLM based TTS, We do not like to make 8 AR models to fit that codec. so is there any imporoved design for AR LLM based TTS? what is the better config?

rlangman commented 3 weeks ago

Hey, sorry for the slow response. If I understand your question, you can have 1 AR model predict all 8 codebooks at the same time (eg. input the final output of your network into 8 independent softmax functions). You do not need 8 separate AR models.

JohnHerry commented 3 weeks ago

Hey, sorry for the slow response. If I understand your question, you can have 1 AR model predict all 8 codebooks at the same time (eg. input the final output of your network into 8 independent softmax functions). You do not need 8 separate AR models.

Yes, We can make a 8 prediction-layer AR model to predict the 8 codes at a single step, but it is not convenient sometimes. Nearly all AR LLM TTS are suffering from some Inherent defects like repetition, deletion, or some other problems on sampling. some work like VALL-E2 take check on synthensize repetition. When there is only one prediction output sequence, that is feasible, but when there are 8 output sequence, things become trouble some.
And, AR LLM use a special token to label the sequence ends of prediction in one sequence mode. What is the sequence end token when there are 8 output sequence? When all 8 sequence 'next token's are all the 'end token'? or just any of the 8 'next token's is a sequence end token?

rlangman commented 3 weeks ago

We have a few other works (and code/PRs in NeMo) that cover how exactly to do AR LLM TTS with different codecs:

https://arxiv.org/abs/2406.17957 https://arxiv.org/abs/2409.12117

To be clear, there are not 8 output sequences. There is 1 output sequence, where each element in the sequence has 8 values.

So far as I am aware, every audio codec today has more than 1 codebook/output sequence. What may be confusing is that most LLMs, given an audio codec with N codebooks, typically treat codebook 1 as special and predict it using one inference stream (where they might inject an end of speech token), while predicting codebooks 2 through N using different inference streams. Any algorithm you use with other codecs would work with the spectral codec (or with any of our other audio codecs in NeMo).

The only difference is that other codecs, which use RVQ codebooks, have the requirement that predicting codebook M at timestep T requires conditioning on the predictions for codebooks 1 through (M - 1). With our FSQ codebooks, you no longer need to condition on codebooks 1 through (M - 1), simplifying the inference.

In other words, RVQ based codecs typically require up to N inferences per timestep while FSQ codec only requires 1 inference per timestep. Though I guess this does mean where you choose to inject the end of speech token is arbitrary.

Does that make sense, or am I misunderstanding part of the question?

JohnHerry commented 2 weeks ago

Thank you for the kindly help! As I know there are mainly two types AR LLMs for TTS, The first is VALL-E like, which takes many codebooks in codec, and predict only one of them in AR LLM. the VALL-E predict the first codebook sequence with an AR Transformer and the rest code sequences with another NAR transformer model; the second type is TorToise-TTS like, which takes a single-codebook VQVAE codec to code the audio, and a GPT like LLM to predict the single code sequence. What I asked above is based on the second type. The spect-codec equiped AR LLM can make Parallel prediction to all 8 codes in each step VS RVQ-codec codes may be time-dependent, it is really a progress.
We do like the single codebook codec in production, not only because of the end of sequence token prediction problem, one sequence in, one sequence out, that is the most natural NLP LLM mode, so some experience in NLP can be directly introduced into TTS AR LLM. say, if we want to evaluate the proformance of AR LLM, we can directly use the top 5 hit-rate; top 10 hit-rate like estimates used in NLP, in the single-codebook TTS AR LLM. but when the output is a sequence that each timestep contains 8 values, things may be different. To make sure equal of all the 8 pair of values from label and prediction is harder, while single-codebook only need to compare a single pair. Intuitively the two AR LLMs are not stands at a fair starting line. Here I only suggested a model estimation problem, I thinks there may be other kinds of things that single-value token sequence works better.
There is also some RVQ codec paper argue that we can prefer the codebook of the last layer to train the LLM, because the residual structure helps the last VQ layer learns more while keeps infomation from all former RVQ layers. I am not sure about it.
single-codebook codecs may be too compressed to recover the speech (or spec) signal, Yes, it is really a hard problem.
Thank you for the good job and for your your kindly help.