CODEJIN / NaturalSpeech2

MIT License
140 stars 15 forks source link

Question about Codec #12

Open Paulmzr opened 2 months ago

Paulmzr commented 2 months ago

Hi, thanks for your great efforts. I notice that you write "Meta's Encodec 24K version was also tested, but it could not be trained.". Does that mean that using meta's encodec leads to poor performance?

CODEJIN commented 2 months ago

Dear @Paulmzr ,

Hello,

The training itself has not been successful. Afterward, I conducted a few tests independently, and I have personally drawn the following conclusions.

  1. I think that combining the NaturalSpeech2 code from this repository with Encodec in my current environment does not allow proper training.

  2. The possible causes of this issue could be the following:

    • The written code is incomplete.
    • When using a codec trained on a much wider range of external audio, the complexity of the codec latent becomes too challenging for diffusion to handle.
    • As the number of RVQ stacks increases, the final latent complexity increases, making it difficult for diffusion to handle.
    • To learn the relationship between text and codec latent, convergence cannot be achieved without using a very large batch size.
  3. Regarding the first and second issues mentioned above, considering that a certain level of training is possible when using Hifi-codec, I believe they are unlikely to be the main reasons, even if they contribute to the issue.

  4. The increase in complexity due to many RVQ stacks could be a potential cause of the problem. In fact, the Hifi-codec, which does train, only uses 4 VQs and even splits the dimension in half with 2 stacks for each, following a simple structure.

  5. The need for a high batch size may be linked to the complexity and could be a potential cause of the problem. However, it is difficult to verify this with the time and GPU resources I have. Given the time constraints, it is not easy to fully validate this, even with the application of accumulation techniques.

If you have any feedback on this matter, I would greatly appreciate it.

Thank you.

Paulmzr commented 2 months ago

@CODEJIN Thank you for your detailed response! I will try to train it and share my findings!