BUTSpeechFIT / EEND

70 stars 10 forks source link

Questions about training. #6

Closed wngh1187 closed 11 months ago

wngh1187 commented 1 year ago

Hi. I am getting a lot of help from your amazing research. Thanks again for making your code public.

By the way, I am now facing difficulties in reproducing the EEND-EDA system. When training with ns2_beta2 data, I get a dev DER of 28.62% at 100 epochs.

This is much different from the performance reported in the original paper (2.69%), could you please share the pre-trained model weights?

In my experiment, the training standard loss and training attractor loss are reported as 0.404 / 0.106 at 100 epochs. Are these loss values different from your experimental results?

msh9184 commented 1 year ago

Hi, I have the same question too. In my case, I obtained a DER of 11.48 at 100 epochs. (I used the dscore repository for the DER calculation)

There are my result and the checkpoint at 100 epochs. I trained the EEND-EDA model using a single NVIDIA RTX 3090 gpu on the default settings.

Could there be something I've missed?

fnlandini commented 1 year ago

Hi, Thanks for the kind words and glad it helps.

That is indeed too far from what I would expect. I have not run too many experiments with the simulated mixtures that you are using. I only ran 5 of them to create the "SM-P" blue bar in Figure 1 here https://arxiv.org/pdf/2204.00890.pdf to show as baseline for our approach for generating simulated conversations. Out of those 5 trainings, I only evaluated swb_sre_cv_ns2_beta2_500 (from Hitachi) for one of them and obtained 3.88 DER. Even if I score with collar 0, I get 9.56 DER so it is not that. Unfortunately, I don't have those data anymore and regenerating it is quite some work so I cannot evaluate other models.

However, in terms of training loss, all 5 runs get to the 0.1 to 0.12 ballpark for the BCE ("standard") loss and to 1e-7 to 1e-6 ballpark for the attractor loss as in the picture. So I think all of them would have similar results on that dataset. Are you using the exact same parameters for training and inference as we shared? image

The attractor loss usually gets to virtually 0 so the differences might be related to that?

wngh1187 commented 1 year ago

hello. Thank you for your quick reply.

I had done a batch of 600 for a quick result check. However, it seems that this was an ineffective choice.

I retrained with the same parameters and was able to get pretty similar results to the trend in the loss graph you showed. However, I still got a result of 18.21% DER at collar 0. In my experiment, I noticed that the standard loss increases for 60-130 steps and then converges again. Is this trend normal?

newplot newplot (1) newplot (2)

fnlandini commented 1 year ago

I don't see that increase in any of my 5 trainings with these data. Does your learning rate look like this? image

It makes sense that when you used a much larger batch you had different results. In this case, the learning rate scheduler reaches the maximum after 100k steps (with batchsize 64) and if you use larger batches, the same amount of steps corresponds to a much later stage in the training. In this learning rate curve you can see that the maximum LR (a bit below 2e-4) is reached after 100k steps which is a bit less than one third of the training. I also tried batchsize 32 with 200k steps in other experiments and the performance I obtained for the final model was similar.

wngh1187 commented 1 year ago

The running rate graph in my experiment is the same. Maybe there's an error somewhere that I haven't found yet. I'll check the whole process again. Once again, thank you very much for your kind response.

fnlandini commented 11 months ago

Closing due to inactivity. Feel free to reopen if you see fit