CODEJIN / NaturalSpeech2

MIT License
140 stars 15 forks source link

Info about implementation #1

Open rishikksh20 opened 1 year ago

rishikksh20 commented 1 year ago

Hi @CODEJIN ,

Hope you are doing well! Are you able to successfully train this implementation on any dataset if so, then how much approx. time and gpus required to train this model on VCTK or LJSpeech dataset? have you trained as large model as mentioned in paper?

You can also refer to : https://github.com/yangdongchao/AcademiCodec for Soundstream and other codec pytorch code.

CODEJIN commented 1 year ago

Dear @rishikksh20,

Hello. This repository is currently a work in progress, and improvements are being made. As stated in the README, the resources available for me to use as an open source project are very limited, so it takes a considerable amount of time for training, testing, and other processes. Once verification is complete, I will add the requested information to the Readme. Also, thank you for the link to the audio codec. After testing EnCodec, I will also refer to the repository you sent.

Best regards

Heejo

rishikksh20 commented 1 year ago

@CODEJIN Is this code trainable? I will train this on Multi-lingual LibriSpeech data on whatever resources I have. I checked the whole code and logic seems fine to me, so I hope this repo is trainable.

CODEJIN commented 1 year ago

Dear @rishikksh20,

Yes, although the pattern generator is not yet compatible with ML LibriSpeech, this code is designed to be trainable and is currently undergoing testing. However, I encountered a NaN loss problem during my testing, so I have been making modifications to address that issue. I just made some additional changes and I am currently conducting further testing.

Best regards,

Heejo

rishikksh20 commented 1 year ago

I have been implementing NaturalSpeech 2 with little bit different architectures of some modules. Let's see how's that's working.

Thanks

rishikksh20 commented 1 year ago

I am finding some inconsistency when calculating loss on diffusion. In paper author predicts z0 denoised output rather than score epsilons image

https://github.com/CODEJIN/NaturalSpeech2/blob/5e816eb1069995ed8a7b1be5fe0cb3c9e8d94ace/Modules/Diffusion.py#L170

rishikksh20 commented 1 year ago

image Also needed to clarify something regarding second loss term, as per author model is predicting z0 rather than score we needed to calculate score for 2nd loss term, as per paper formula is following:

pred score = $λ^{-1} (\hat{z_0} - z_t)$

$\hat{z_0}$ is output of the denoiser model and $z_t$ is noisy input but we needed to calculate $λ$ which is variance of the $p(z_t/z_0)$ distribution. So as per your code: $λ$ = 1 - alphas_cumprod So am I calculating $λ$ correctly or I am missing something?

CODEJIN commented 1 year ago

Dear @rishikksh20,

Hello. Thank you for pointing that out! After your clarification, I re-read that section. I interpreted the first term as the comparison between the predicted z_0_hat generated from z_t and the actual z_0. In that sense, the first term would require denoiser calcuations of multiple steps. And it would be challenging to incorporate the entire gradient with the limited computational resources I have. And, regarding the second term, I think that it's same to the loss term this repository uses now.

If I misunderstood any of the above points, please let me know.

Best regards,

Heejo

rishikksh20 commented 1 year ago

Yes, you are correct, I think it don't require to train on both losses even using single loss can perform good. https://twitter.com/ai_rishikesh/status/1658190660652916746

CODEJIN commented 1 year ago

@rishikksh20 I have reviewed the discussions regarding the mentioned loss term on twitter and lucidrains' GitHub based on the link you provided. Thank you for the valuable information! I have also conducted some experiments with modifying the loss term.

  1. The epsilon-based implementation (no changed).
  2. The z_0 prediction-based implementation in Lucidrian's version.
  3. The z_0 prediction + modified CE-RVQ in Lucidrian's version.

Based on my current checks, I believe that the convergence speed is faster in the order of 1 < 2 < 3. However, I have not been able to check until the completion of full model training, and I have only checked until the point where the voice becomes audible. I plan to continue training using the approach in 3 for now.

And, I have some doubts about the CE-RVQ approach. In section 3.2, it is mentioned, "Then we calculate the L2 distance between the residual vector with each codebook embedding in the quantizer." When I read this sentence, it seems that the value obtained from z_0_hat takes into account the residuals, while the comparison target, the codebook embedding, does not consider the residuals. Based on my understanding, it seems more appropriate to subtract all the values from e_0 to e_R (excluding e_i) from z_0_hat. I find it puzzling to apply softmax to the distance, considering that l2 tends to decrease as the similarity increases. Currently, I am handling it by multiplying the distance by -1.0, but I'm unsure if this approach is correct. Am I misunderstanding something? What are your thoughts on this?

rishikksh20 commented 1 year ago

Hi @CODEJIN Can you push your latest code I have one machine in which I can start training for testing ?

I have only checked until the point where the voice becomes audible.

This is a good sign :thumbsup: .

CODEJIN commented 1 year ago

I pushed the code. If there is any problem, please let me know.

rishikksh20 commented 1 year ago

@CODEJIN check this : https://github.com/lucidrains/naturalspeech2-pytorch/issues/11#issuecomment-1569448545

rishikksh20 commented 1 year ago

also are you able to get good quality audio ?

CODEJIN commented 1 year ago

Dear @rishikksh20,

I have checked the link you provided. It seems that a change in the loss term is necessary. Thank you for sharing the information! The audio quality is still limited. Please refer to this link for updates on my progress: https://github.com/CODEJIN/NaturalSpeech2/issues/4#issuecomment-1568225436 How about your progress? Please let me know.

Best regards,

Heejo

rishikksh20 commented 1 year ago

Thanks for an update. I haven't got time to pre-process dataset as I am working on SoundStorm implementation recently, but now I am dedicatedly working on NS 2, hope to start training before this weekend. I am planning to use recently released LibriTTS-R dataset for this job first then will move to Libri-light.

manmay-nakhashi commented 1 year ago

@rishikksh20 @CODEJIN any update on the results ?

CODEJIN commented 1 year ago

Dear @manmay-nakhashi ,

Hi, I have been making some modifications and attempting training, but there haven't been significant improvements. Most importantly, due to resource limitations, the training for validation is taking too long. I will let you know if any improvements arise in the future.

Best regards,

Heejo

rishikksh20 commented 1 year ago

I have train NS2 but on my first training I am only able to get noise, my code might have some minor diffusion related bug which I am trying to identify.

manmay-nakhashi commented 1 year ago

@CODEJIN loss_dict['Data'] = self.criterion_dict'MSE'.mean() shouldn't you be using diffusion_predictions ? why diffusion_starts ?

CODEJIN commented 1 year ago

Dear @manmay-nakhashi ,

NaturalSpeech 2 utilizes three loss terms for the diffusion module: 1. data loss, 2. epsilon loss, 3. ce-rvq loss. The loss term you mentioned is the data loss term, and you can refer to the equation 6 in the paper section 3.2 for more information:

image

Best regards,

Heejo

manmay-nakhashi commented 1 year ago

@CODEJIN understood thank you for the explanation , i got it now i was thinking this in different way.

manmay-nakhashi commented 1 year ago

@CODEJIN after how many steps you are able to hear the audio ?

hippotabek commented 1 year ago

@CODEJIN so the latest readme is saying that u were able to test it on LJSpeech and VCTK. so it means the training code is working now?

CODEJIN commented 1 year ago

@CODEJIN after how many steps you are able to hear the audio ?

Dear @manmay-nakhashi ,

Sorry for late reply. There have been many changes in the code over the past month. The current code in the master branch produces audible sound from around 30K steps. However, the high-pitched parts are distorted and the pitch is not properly reflected. Applying CE-RVQ and conducting longer training may yield different results, but due to time constraints in testing, the outcome is still unknown.

Best regards,

Heejo

CODEJIN commented 1 year ago

@CODEJIN so the latest readme is saying that u were able to test it on LJSpeech and VCTK. so it means the training code is working now?

Dear @hippotabek ,

I have been frequently updating the code, and although testing is still ongoing, I have confirmed that audio is being generated when trained on the LJ dataset. I am continuously fixing bugs and updating the code, so I can't say it's fully "working" yet, but training is possible and the resulting sound is clearer than before. However, if by "working" you mean whether the model is complete, it is still a work in progress (WIP).

Best regards,

Heejo

sourcesur commented 1 year ago

@CODEJIN Hi, great work! Have you encountered such case where the model generates just noise? Predictions look like this image image image

CODEJIN commented 1 year ago

Dear @rishikksh20 ,

Hi. While it wasn't exactly the same, there was a time when I used 'Encodec' as a latent and it came out in a similar form. Especially when learning the duration through the Nvidia alignment framework and constructing the target feature not with mel but with the Eencodec latent, I remember the duration learning itself was a failure.

Best regards,

Heejo

sourcesur commented 1 year ago

@CODEJIN , Thanks for the quick reply! I have noticed that you don't remove the sliced speech prompts from latent and use different speech prompts for pitch/duration predictors and diffusion model. Also you added frame prior network which is not present in the paper. Were these modifications done for speech improvement?

CODEJIN commented 1 year ago

Dear @sourcesur ,

Hi, thank you for your interest in this repository.

  1. During training, I have confirmed that the speech prompt is currently correctly sending different values to the pitch/duration predictor and the diffusion side. You can check this at here. It is meaningless to input different values to the two at the inference stage, and I know that the same speech prompt is used at the inference stage in the paper.

  2. The Frame Prior Network (FPN) is not a module in the paper and is a supplementary module that I added for quality improvement. The original goal was to improve learning speed by adding the linear prediction values generated by FPN to the diffusion context. This worked as expected in single-speaker environments, but I found that when linear prediction was input in multi-speaker environments, the prompt formation failed and the quality decreased. Therefore, I removed the linear prediction. I am still experimenting, and if I determine that FPN itself does not provide a significant benefit to learning, I will remove it.

Best regards,

Heejo