AO Training Issue - Githubissues

burchim / AVEC

[WACV 2023] Audio-Visual Efficient Conformer (AVEC) for Robust Speech Recognition

https://openaccess.thecvf.com/content/WACV2023/html/Burchi_Audio-Visual_Efficient_Conformer_for_Robust_Speech_Recognition_WACV_2023_paper.html

Apache License 2.0

78 stars 7 forks source link

AO Training Issue #3

Closed park-ing closed 1 year ago

park-ing commented 1 year ago

Hello. Thank you for opening up a good material. While carrying out the instructions, I observed that Visual Only training converges well, Audio Only training does not converge, and CTC Loss returns Nan. What's the problem?

burchim commented 1 year ago

Hi,

A similar problem was observed for the Efficient Conformer in this issue.

Using AdamW optimizer instead of regular Adam here with default weight decay of 0.01 might solve your issue: optimizer = optimizers.AdamW(params=self.parameters(), lr=lr, betas=(0.9, 0.98), eps=1e-9, weight_decay=0.01)

Does nan loss appear after a certain number of epochs or directly at the start of training ?

park-ing commented 1 year ago

Thank you.

I used AdamW Optimizer, but the same problem occurred.

Is there a problem with the data preprocessing?

In AV training, nan loss occurs after certain epochs, but in AO training, it occurs directly at the start of training.

yochaiye commented 1 year ago

From my observation, it appears as soon as the spectrograms travel through the network, without even computing the loss. Pondering here too..

burchim commented 1 year ago

Hi,

The issue may be caused by attention masking with value -1e9. This will generate inf values for float16 training. Does changing these two lines in attention.py solve the problem ?

att_scores = att_scores.float() + (mask.logical_not().float() * -1e9) here
att_w = att_scores.softmax(dim=-1).type(Q.dtype) here

I also encounter this problem with last PyTorch version but only for Patch Attention layers.

Did you try using AdamW for AV training ?

park-ing commented 1 year ago

I also tried AV training, but audio encoder didn't seem to work properly. I'm trying a different way.

alial7621 commented 1 year ago

Hey man Did you solve the problem?

yochaiye commented 1 year ago

I changed the precision to fp32 in the config file and it resolved the problem

park-ing commented 1 year ago

I wonder if there was a change from the original code to float32.

I still haven't solved this problem yet.

burchim commented 1 year ago

Hi,

So did the solution of modifying the two lines in attentions.py work for any of you in the case of AO training?

@park-ing All the paper experiments were done with float16 using the same code. The behaviour may change between PyTorch versions however. Using float32 should solve the problem, but will off course result in slower training.

In the case of AV training, after how many gradient steps do you start having nan loss? Do you use same hyperparams (GPUs, batch size etc) as in the paper?

You may want to log the range of model hidden activation during training to see where and when activations range goes out of the fp16 range.

Best, Maxime

park-ing commented 1 year ago

The modification of the two lines of code you told me about in the AO and AV environment worked. (FP16) And this can be converged well even with Adam optimizer.

I am testing in an NVIDIA RTX 3090 environment, PyTorch version "1.13.1+cu117".

Thank you for teaching me in a good way.

white1-doggy commented 2 weeks ago

The modification of the two lines of code you told me about in the AO and AV environment worked. (FP16) And this can be converged well even with Adam optimizer.

I am testing in an NVIDIA RTX 3090 environment, PyTorch version "1.13.1+cu117".

Thank you for teaching me in a good way.

HI,I want to know your training detail in RTX 3090, because I cann't trian this model in 2 A100 80 G