CODEJIN / NaturalSpeech2

MIT License
140 stars 15 forks source link

Quesion about loss ce_rqv #4

Open Autonomof opened 1 year ago

Autonomof commented 1 year ago

Hello, when I try train the model I find that loss ce_rqv just does not decrease. The result of inference is not as good as expected. The synthesized sound sounds blurry, and I want to know if this is related to the loss ce_rqv.

Autonomof commented 1 year ago

Hello, I found that you used 4 out of 32 quantizers randomly selected to calculate the L2 distance in order to save memory. However, I believe that the quantization corresponding to different layer quantizers is important, and the quantizers with the higher number of layers are more important. For example, the first quantizer is much more important than the 32nd quantizer. Therefore, randomly selecting four quantizers at each step may cause losses and distress.

CODEJIN commented 1 year ago

Dear @Autonomof

That's an interesting point! So, in your opinion, should model conduct CE-RVQ for all 32 quantizers if possible, or if that's not feasible, should model prioritize CE-RVQ for the initial quantizers rather than selecting them randomly? Please let me know.

Best regards,

Heejo

Autonomof commented 1 year ago

Dear @CODEJIN

Yes, I think so. And I have been trying using the first four quantizers to train the model and I haven't been running for long, but the current result show that the losses converge better and faster slightly. In addition, I think it is necessary to use all the 32 layer quantizers in a way that would have better results. However, in situations where resources are limited, directly using the 32 layer quantizer would be out of merory. I don't know what to do, and I don't know if it is possible to divide the 32 layer into several calculations, such as calculating 4 quantizers each time for a total of 8 calculations,finally adding up the loss of 8 calculations , which is equivalent to using time to improve graphics memory efficiency. Do you have any ideas?

Best regards,

QAB

CODEJIN commented 1 year ago

Dear @Autonomof,

I am currently training both versions, the one with randomly selecting 4 quantizers (Exp2004 in figure) and the one with training all 32 quantizers (Exp2007 in figure), taking your suggestions into consideration. The version with randomly selecting 4 quantizers has a batch size of 8, while the version with training all 32 quantizers has a batch size of 4 because of the GPU memory limitation. image The version training all 32 quantizers shows less fluctuation, but this seems to be due to averaging over the 32 quantizers, and further training is needed to verify the actual difference.

One approach to training the entire CE-RVQ within the limited GPU resources is to apply the accumulated gradient technique. It is not difficult to implement within PyTorch, and I have just updated the code to include the accumulated gradient feature. Additionally, if we want to reflect the different importance of each layer in the current sampling approach, we can consider changing the sampling method. I have also incorporated this feature into the code.

If you have any further suggestions, please let me know.

Best regards,

Heejo

Autonomof commented 1 year ago

Thank you very much for your detailed experiment and explanation. I am not sure on which training datasets you trained. I trained on VCTK and the following two are comparison charts of randomly selecting four and using the first four.

randomly selecting: 企业微信截图_16848273246542

using the first four: image

Though the two charts have different training numsteps, the result shows that selecting the first four quantizers will make the training process smoother. But I am not sure about the specific impact of the final result, too. I will see the inference effect again after further training for a period of time. Thank you for your reply.

Best regards,

QAB

CODEJIN commented 1 year ago

Dear @Autonomof,

The graphs I posted were trained on a single speaker (LJ). In addition to the above graphs, I also conducted training using the random 4 samples with VCTK, but found no significant difference. Could you please let me know what batch size you are using in your training?

Best regards,

Heejo

Autonomof commented 1 year ago

Of course,I used 12 batchsize. So haven't you had the same situation as me?

CODEJIN commented 1 year ago

Thank you for letting me know! As mentioned earlier, I used a batch size of 8 when randomly selecting 4 quantizers and a batch size of 4 when using all 32 quantizers. If training requires a batch size higher than a certain level, I hope that accumulated gradient can be helpful.

Autonomof commented 1 year ago

OK,I see and I will try it . Thank you very much!

Autonomof commented 1 year ago

Sorry to bother you, I have two other questions. 1) I saw that the temperature parameters were set in section 4.3 of the original paper, but I did not see them in the code. 2)Additionally, I would like to know how long it will take to train this model to produce an accurate result.I have probably trained 1GPU 12 batchsize 300k steps, but the synthesized speech is still quite blurry. I don't know if continuing training is meaningful. As both the original paper (16GPUs 6K frames 300K steps ) and the Gradtts paper (1GPU 16 batchsize 1.7M steps ) have trained a considerable amount of steps. Gradtts paper refers as follow: image

CODEJIN commented 1 year ago

Dear @Autonomof,

  1. I had overlooked the temperature part. I have added the functionality in a similar way to Grad-TTS, but it will only be possible to confirm if it is working properly after stable training has been achieved.
  2. I am also contemplating this issue. Even with a batch size of 4, there is no improvement in sound quality from 200K to 500K steps, it suggests that code modifications are necessary. The current point of concern is the size of the latent variable. The latent variable in the Encodec ranges from approximately -40.0 to 40.0, which is significantly larger compared to Gaussian noise. Therefore, even in large diffusion steps where the model should be closer to random noise, it is likely that the model is biased towards overly refined noised latent values. To address this issue, I applied latent compression yesterday. The model now uses a latent info dictionary, and the latent variable is predicted to be between -1.0 and 1.0. I plan to test this version of the code over the weekend.

Best regards,

Heejo

PS. Additionally, regarding your mention of 6000 frames in NaturalSpeech2, I believe that was the benchmark when using all 16 GPUs, so I expect that for a single model, the frame count will be smaller.

Autonomof commented 1 year ago

Okay, I understand. Thank you very much for your detailed answer to my question.

Autonomof commented 1 year ago

Hi,Sorry to bother you. I would like to know how the training results are now. Is the progress going smoothly? Is it because the potential variable range of the Encodec is too large that the speech synthesis results are blurry?

CODEJIN commented 1 year ago

Dear @Autonomof,

Hello, currently I am conducting several tests under the LJ single speaker condition.

  1. CE-RVQ sample conditions: random select 4, weighted random select 4, all select 32. Among them, for all select 32, the batch size is 4, while for the others, it is 8.
  2. Learning rate conditions: 5e-4, 1e-4
  3. Encodec latent normalizing conditions: nothing, [-1.0, 1.0] by min-max, standard normal distribution

The rest are still in the early stages of training.

I started to test the standard normal distribution method additionally because the sound is still distorted even at 500K steps with the [-1.0, 1.0] by min-max normalizing, and it is unclear how much it can be improved.

If the performance improvement remains difficult even after conducting all the above test cases, the next variables to consider are as follows:

  1. Linear attention -> Multi-head attention
  2. Diffusion max step 1000 -> 100
  3. Rollback of diffusion loss terms

However, for option 1, it could severely limit the batch size even more than now, so I would like to avoid it if possible. If you have any suggestions, please let me know. I would greatly appreciate your opinion.

Best regards,

Heejo

Autonomof commented 1 year ago

@CODEJIN I'm glad to see you update again, it feels like an old friend has returned. I see you have changed encoding to hificodec, and the way you use diffusion has also changed. I don't know how the effect is? I have been trying to reproduce NS2 based on your code recently, but no matter how I modify it, it is not correct. The synthesized sound sounds semantically correct, but there are always fuzzy electrical sounds, as shown in the Mel spectrum. And the harmonics are not clear, and I don't know where the problem may lie?

image

CODEJIN commented 1 year ago

Dear @Autonomof ,

Hi, although the effect of diffusion calcuation is still not clear, but the experimental results show that Encodec is not suitable for NaturalSpeech2. In the case of Encodec, it has a much deeper RVQ stack with a smaller latent dimension. In my opinion, such a deep RVQ increases the complexity of the final latent space, making it difficult for diffusion to make accurate predictions.

If you want to implement it yourself, how about testing whether the model can accurately predict mel-spectrograms instead of the latent space in the codec? Mel-spectrograms are physical representations and have lower complexity compared to the latent space. So if the model fails to predict mel-spectrograms accurately, it indicates that there may be a need for improvements in the model architecture. On the other hand, if the model successfully predicts mel-spectrograms, it suggests that changing the codec would be more beneficial. The current repository has followed this step-by-step approach for validation.

Best regards,

Heejo

Autonomof commented 1 year ago

@CODEJIN Thank you for your good reply. We can choose 16 layers instead of 32 layers of Encodec. Currently, I don't think it's a problem with Encodec. Can you elaborate on how you validated which Encodec is not suitable for NaturalSpeech2 through experiments? In addition, I am considering a question whether we need to train in modules seperately during training. I mean that training together can make the diffusion model very confusing because all modules are changing, which makes it difficult for the denoiser to denoise in a fixed manner. As a result, the predicted latents are difficult to be accurate. Because in my opinion, gradtts training also separates alignment and diffusion until convergence, and does a large parameter model like ours need to consider this more? I think this is a very important issue to consider.

CODEJIN commented 1 year ago

Dear @Autonomof ,

Hi. As I mentioned before, the validation method I used is quite simple. I verified if the implemented model performs well on an easier task and gradually increased the difficulty. In this case, the easier task was predicting mel-spectrograms for a single speaker. Once I confirmed that the predictions were successful under this condition, I kept the other variables unchanged and switched to encoding mel-spectrograms into Encodec latents. Only after failing under this condition did I conclude that Encodec might not be suitable and switched to Hifi-Codec, which resulted in successful training.

Training the Alignment module separately is also a good approach, in my opinion. The alignment operation in Grad-TTS was inspired by the monotonic alignment search in Glow-TTS, and these models consistently applied stop gradient. However, the current repository uses ALF for alignment, and the convergence of ALF alignment occurs at around 5000 steps. Considering that Noam decaying is also applied, I believe that initial alignment won't be much of a problem as long as the learning rate is not excessively high, and stop gradient may not be necessary.

Best regards,

Heejo

Autonomof commented 1 year ago

@CODEJIN A very good experiment, I quite agree with it. Do you mean using Hifi-Codec as a neural network codec and the predict latents of a single speaker successfully?

Autonomof commented 1 year ago

@CODEJIN Hello, would you mind sharing some information about the alignment learning network used in this respositories so that I can better understand this network?Because I see that some of your code seems to come from fastpitch, but some cannot be found as open source. Is this network different from another learning network? Becasue I am still concerned that alignment may introduce unnecessary noise into the denoising network.

CODEJIN commented 1 year ago

Dear @Autonomof ,

Hi, most of the code in question is indeed taken from fastpitch 1.1. Therefore, the module performs the same operations as fastspeech 1.1. I simply added a class to manage the entire ALF for convenience. And, I will share some of the results from my recent tests: LJ_102K.zip The two files zipped are the results of training with a single speaker(LJ) and a batch size of 16 for 102K steps. Since it is still in the early stages of training, it is unclear whether the instability is due to that or if it is a limitation of the current implementation. Additional verification is needed.

Best regards,

Heejo

Autonomof commented 1 year ago

Dear @Autonomof ,

Hi, most of the code in question is indeed taken from fastpitch 1.1. Therefore, the module performs the same operations as fastspeech 1.1. I simply added a class to manage the entire ALF for convenience. And, I will share some of the results from my recent tests: LJ_102K.zip The two files zipped are the results of training with a single speaker(LJ) and a batch size of 16 for 102K steps. Since it is still in the early stages of training, it is unclear whether the instability is due to that or if it is a limitation of the current implementation. Additional verification is needed.

Best regards,

Heejo

I am very pleased that there has been a significant improvement in sound quality. Thank you very much for your explanation and work.

Autonomof commented 1 year ago

@CODEJIN Hello, I found that layer normalization does not seem to be used in the diffusion model, only the linear attention inside has layer normalization.I don't know if this is normal.

CODEJIN commented 1 year ago

Dear @Autonomof ,

It's difficult to say it's general. The current structure is a mixture of algorithms from OpenAI's guided-diffusion and DiffWave, combined in my own way. Adding layer norm to each conv might potentially lead to some performance improvement, but I need to conduct experiments to confirm its effectiveness.

Best regards,

Heejo