Smerity / sha-rnn

Single Headed Attention RNN - "Stop thinking with your head"
1.18k stars 133 forks source link

Gradient overflows #7

Open stefan-it opened 4 years ago

stefan-it commented 4 years ago

Hi @Smerity ,

thanks for open sourcing the code for that great project :heart:

I trained a character-based model for German on ~1GB of text (mainly from OPUS). It worked well for two epochs, but then the following error message is thrown:

| epoch   1 | 121090/129094 batches | lr 0.00200 | ms/batch 216.22 | loss  0.83 | ppl     2.28 | bpc    1.191
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
| epoch   1 | 121100/129094 batches | lr 0.00200 | ms/batch 190.10 | loss   nan | ppl      nan | bpc      nan
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0                                  
Traceback (most recent call last):                                                                           
  File "main.py", line 379, in <module>                                                                      
    train(epoch - 1)                                                                                         
  File "main.py", line 302, in train                                                                         
    scaled_loss.backward()                                                                                   
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__                                              
    next(self.gen)                                                                             
  File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)                                                  
  File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)                            
  File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 127, in post_backward_models_are_masters
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))                          
  File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 176, in unscale_with_stashed
    out_scale/grads_have_scale,   # 1./scale,                                                  
ZeroDivisionError: float division by zero

Then I resumed training with a lower learning rate (pretty much the same parameters as stated in the main readme) and the same error is thrown after one epochs.

Do you know how this can be prevented 🤔

However, the generated text is very interesting 😅

||||D i e _ P r o t o k o l l e , _ d i e _ a u f _ B e r e c h n u n g e n _ d e r _ F a m i l i e _ P a a r u n g _ u n d _ d e r _ E n t s c h e i d u n g e n _ f ü r _ d i e _ R e g u l i e r u n g _ d e r _ U n i v e r s i t ä t _ v o r g e b r a c h t _ w e r d e n , _ w i r d _ d e m n a
c h _ m i t _ n i e d r i g e n _ G e w i n n n i v e a u s _ d i s k u t i e r t _ w e r d e n .
S i e _ k ö n n e n _ a u c h _ z w i s c h e n _ d e n _ v e r s c h i e d e n e n _ K o n z e p t i o n e n _ v o n _ F a m i l i e n _ u n d _ K i n d e r n _ i n t e r e s s i e r e n : _ b e i s p i e l s w e i s e : _ B i o g r a p h i e , _ M a g i e , _ G e s c h i c h t e , _ C a p t a
i n _ S l a v i a - S t i l , _ A n s i c h t e n _ u n d _ V i d e o s .
D i e s e r _ S c h a l l p e g e l _ l ä u f t _ i n _ e i n _ H ö h e n v e r s t e l l u n g s g e f ä ß _ d e s _ a l l g e m e i n e n _ G e r ä t e s _ b e i _ d e r _ D i c h t h e i t .
||||A u f _ d e r _ W e s t s e i t e _ d e r _ A u t o b a h n _ A 1 , _ n a h e _ L a _ G o m e r a _ b e f i n d e n _ s i c h _ z w e i _ S t r a ß e n v e r b i n d u n g e n _ z w i s c h e n _ d e n _ B e r g e n _ u n d _ d e r _ S e h e n s w ü r d i g k e i t .
Z u _ d e n _ f o l g e n d e n _ D i e n s t l e i s t u n g e n _ g e h ö r e n _ T e l e f o n , _ k o s t e n l o s e _ P a r k p l ä t z e , _ e i n _ B ü g e l e i s e n / - b r e t t _ ( 2 4 - S t u n d e n - R e z e p t i o n ) . 
Smerity commented 4 years ago

Thanks for running this on German @stefan-it! I haven't done experiments on different languages yet and it's great to see it at least hold!

Unfortunately I have run into similar issues in the past re: NaNs at around 15 or 20 epochs on the enwik8 data which works out to around 2 epochs on your larger German dataset.

I still haven't tracked down exactly what it might be but I do know the random seed can impact it. My guess is that there might be an issue with dropout over the attention window or similar.

I'll be making a new cleaner codebase and ensuring that issues like this don't occur will be a top priority. As a temporary fix if you're curious to continue investigating you could save the model once every N iterations if the loss hasn't NaN'ed out and restart as you've done with a different random seed. That's admittedly not a great solution however.

I'm glad you're enjoying the generated text! I ran it through Google Translate and it at least produces something I can read lol. I'll note that for this model the more context you can seed it with the better!