Sense-X / UniFormer

[ICLR2022] official implementation of UniFormer
Apache License 2.0
812 stars 111 forks source link

Loss&error explosion #106

Closed SKBL5694 closed 1 year ago

SKBL5694 commented 1 year ago

Hi, it'me again. It took me a while to organize my data, I start my training 3 hours before. I choose 12 classes actions from different datasets like charades and NTU action Recognition. There are about 28,000 training videos (the number of each category may not be balanced), and I resize all the video to make the short edge 320. Unfortunately, I found my loss explosion after 10 epochs. I mean the top1 err first go down for several epochs(the least error is about 30+ errors) and than get bigger and bigger, for now the top1 error is 40+. By the way, I think my mission is not a difficult mission, so this problem confused me. I will let it continue training for one night and wait for tomorrow to see the results. But for now, I think it has got some problems. I watched a video of you talking about this paper, and I remember I have mentioned that during your training, you also get some case that the loss explosion. So can you give me some advice for this situation to your experience? Thanks a lot.

Andy1621 commented 1 year ago

For your problem, I think it's loss NAN. It's a normal phenomenon. To avoid the problem, I have some experience:

  1. Close mixed-precision training: Straightforward, but costly.
  2. Use a lower learning rate: 1/2, 1/4 or smaller (a large learning rate often leads to loss NAN).
  3. Weaker data augmentation: Use weaker RandAugment.
  4. Add more warmup epochs...