Open Paulmzr opened 2 months ago
Dear @Paulmzr ,
Hello,
The training itself has not been successful. Afterward, I conducted a few tests independently, and I have personally drawn the following conclusions.
I think that combining the NaturalSpeech2 code from this repository with Encodec in my current environment does not allow proper training.
The possible causes of this issue could be the following:
Regarding the first and second issues mentioned above, considering that a certain level of training is possible when using Hifi-codec, I believe they are unlikely to be the main reasons, even if they contribute to the issue.
The increase in complexity due to many RVQ stacks could be a potential cause of the problem. In fact, the Hifi-codec, which does train, only uses 4 VQs and even splits the dimension in half with 2 stacks for each, following a simple structure.
The need for a high batch size may be linked to the complexity and could be a potential cause of the problem. However, it is difficult to verify this with the time and GPU resources I have. Given the time constraints, it is not easy to fully validate this, even with the application of accumulation techniques.
If you have any feedback on this matter, I would greatly appreciate it.
Thank you.
@CODEJIN Thank you for your detailed response! I will try to train it and share my findings!
Hi, thanks for your great efforts. I notice that you write "Meta's Encodec 24K version was also tested, but it could not be trained.". Does that mean that using meta's encodec leads to poor performance?