Details in network training

tomato18463 commented 1 year ago

Hi,

I am trying to implement the training process described in your arxiv paper. I wonder if you can kindly provide some more details about your training setting:

I note the farend talk are randomly sampled to be 1 second, and the nearend talk between 0.5 - 1 second. When you mixed them to produce double talk, did you randomly select a position in the echo to insert the nearend talk?
What optimizer did you use? What was the setting of other hyper-parameters of the optimizer apart from learning rate (gradient clipping, momentum, etc)?
The echo generated through full convolution is longer than the farend signal, as the length is farend length + rir length - 1. Did you do any clipping on the echo, or you just retain the raw convolutional result?
Did you do any random scaling for signals in the synthetic fold of the AEC challenge, before sampling them to create nearend and farend talks?

Thank you very much!

tomato18463 commented 1 year ago

A further question: after you mix the nearend talk and echo, did you add noise to the mixed signals?

Thanks!

fjiang9 commented 1 year ago

@tomato18463

Yes, the insert position is random.
We used Adam with default settings. We used gradient clipping with a clipping value of 1.
Clipped to 1 s as described in the peper.
No random scaling is applied to the speech signals.
We did not add any noise to the mixed signals.

Finally, it is noted that these may not be the optimal settings for training NKF-AEC.

tomato18463 commented 1 year ago

Yes, the insert position is random.

We used Adam with default settings. We used gradient clipping with a clipping value of 1.

Clipped to 1 s as described in the peper.

No random scaling is applied to the speech signals.

We did not add any noise to be mixed signals.

Finally, it is noted that these may not be the optimal settings for training NKF-AEC.

I see. Thank you very much for your reply! I have got into some other problems, and it would be appreciated if you could give some further help:

The loss function in Eq. 17 is a sum over all taps and frequency bins. In your implementation, did you divide it by a factor to get the mean, or you simply used the sum?
What was the batch size of training?
As mentioned in the paper, there is a 50% chance that echo path, RNN hidden state and echo path innovation are initialized using white Gaussian noise (WGN). How did you set the variance of these Gaussians? When initializing echo path innovation with WGN, did you generated an old and a new echo path and did a subtraction, or you used some other ways?
I find it's easy to get nans with the default weight initialization of pytorch. What was the network weight initialization setting in you experiment?

Thank you again!

tomato18463 commented 1 year ago

@fjiang9 A further question:

We used Adam with default settings. We used gradient clipping with a clipping value of 1.

Was the gradient clipping done by torch.nn.utils.clip_gradvalue or torch.nn.utils.clip_gradnorm?

Thank you!

fjiang9 commented 1 year ago

torch.nn.utils.clip_gradnorm is used.

shenbuguanni commented 1 year ago

@tomato18463 Have you solved the problem that Nan is prone to appear? and I find mse loss (sum) between the real_echo and est_echo is very large.

tomato18463 commented 1 year ago

@tomato18463 Have you solved the problem that Nan is prone to appear? and I find mse loss (sum) between the real_echo and est_echo is very large.

I found I could alleviate the Nan problem by the combination of the following tricks on some certrain data generation setting:

Use a larger batch size
Use gradient clipping
Use shorter training sequences at the first few epoches of training
Initialize the network weights to smaller values
I might also use learning rate warmup (i.e. starting from a small learning rate and grows gradually to a larger one, before you drop it by 0.5 every few epoches).
I might have used other tricks but I can't remember them - the experiment was quite some months ago. I might be able to find some time to check it a couple of days later.

Also I could not guarantee they work on different data generation settings.

shenbuguanni commented 1 year ago

Please ask use echo_hat or e to calculate the loss with the real echo? I use echo_hat and real_echo calculate mse loss which the value is very large, ie. 170446899.20. So, I think there are some obvious things that are not confirmed well. I hope to get your help, and I would like to ask you what the actual effect of the model you trained yourself is. Thanks!

shenbuguanni commented 1 year ago

if use white Gaussian noise as RIR? Why not use the image method to generate RIR？

shenbuguanni commented 1 year ago

3. How did you set the variance of these Gaussians

Please ask how did you set the variance of these Gaussians noise?

fjiang9 / NKF-AEC

Details in network training #5