Question about Codec - Githubissues

Paulmzr commented 2 months ago

Hi, thanks for your great efforts. I notice that you write "Meta's Encodec 24K version was also tested, but it could not be trained.". Does that mean that using meta's encodec leads to poor performance?

CODEJIN commented 2 months ago

Dear @Paulmzr ,

Hello,

The training itself has not been successful. Afterward, I conducted a few tests independently, and I have personally drawn the following conclusions.

I think that combining the NaturalSpeech2 code from this repository with Encodec in my current environment does not allow proper training.
The possible causes of this issue could be the following:
- The written code is incomplete.
- When using a codec trained on a much wider range of external audio, the complexity of the codec latent becomes too challenging for diffusion to handle.
- As the number of RVQ stacks increases, the final latent complexity increases, making it difficult for diffusion to handle.
- To learn the relationship between text and codec latent, convergence cannot be achieved without using a very large batch size.
Regarding the first and second issues mentioned above, considering that a certain level of training is possible when using Hifi-codec, I believe they are unlikely to be the main reasons, even if they contribute to the issue.
The increase in complexity due to many RVQ stacks could be a potential cause of the problem. In fact, the Hifi-codec, which does train, only uses 4 VQs and even splits the dimension in half with 2 stacks for each, following a simple structure.
The need for a high batch size may be linked to the complexity and could be a potential cause of the problem. However, it is difficult to verify this with the time and GPU resources I have. Given the time constraints, it is not easy to fully validate this, even with the application of accumulation techniques.

If you have any feedback on this matter, I would greatly appreciate it.

Thank you.

Paulmzr commented 2 months ago

@CODEJIN Thank you for your detailed response! I will try to train it and share my findings!

CODEJIN / NaturalSpeech2

Question about Codec #12