clab / dynet

DyNet: The Dynamic Neural Network Toolkit
Apache License 2.0
3.42k stars 704 forks source link

Gradient Clipping #568

Open talbaumel opened 7 years ago

talbaumel commented 7 years ago

Tested the new auto-batching on a seq2seq model and got: RuntimeError: Magnitude of gradient is bad: inf

neubig commented 7 years ago

Thanks for the report! Are you using the latest code? I fixed a bug yesterday that might have been causing a problem like this on GPU.

If you're using the latest pull, could you share the code for a minimal reproducible example?

pmichel31415 commented 7 years ago

I had the same problem with a similar model (LSTM-VAE) on this commit 55266dc7fed7b3fc90dab0b2d2a7223b41c541db ("Fixed some doc")

As for a minimal reproducible example... I'll see what I can do

talbaumel commented 7 years ago

I got the error after few hours of training, not sure about getting a minimal example. Looking over the commits history I can see I installed it before the bug fix yesterday I'll reinstall, test it and update sometime if the training succeeded

Thanks!

talbaumel commented 7 years ago

The error re appeared, this time after 20 hours

`--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last)

in () ----> 1 train(model, train_set, 2) in train(model, train_set, epochs) 19 batch_loss.npvalue() # this calls forward on the batch 20 batch_loss.backward() ---> 21 trainer.update() 22 23 _gdynet.pyx in _gdynet.Trainer.update (_gdynet.cpp:59135)() _gdynet.pyx in _gdynet.Trainer.update (_gdynet.cpp:59039)() RuntimeError: Magnitude of gradient is bad: inf` btw, the auto batch seems to save 10~ hours from 35 hours run (without the auto batch). Great job!
neubig commented 7 years ago

Hmm, and just to confirm: this error doesn't occur when you set --dynet-autobatch 0 with the other settings exactly the same, correct? It won't be possible to debug without some way to reproduce the problem, but if we can get code/data that causes the problem to occur consistently (hopefully in a much shorter time than 20 hours), that would really help.

pmichel31415 commented 7 years ago

This issue seems to be fixed for me... Maybe the lookup parameter fix fixed that as well... Not sure honestly where this came from.

It also saves a lot of time, which is really cool

talbaumel commented 7 years ago

Managed to run through my entire dataset with lower batch size

22 hours with auto-batching 35 without!

talbaumel commented 6 years ago

Hi, Sorry to reopen this issue, Is there anyway to avoid RuntimeError: Magnitude of gradient is bad: inf?

Background: It happens during hyper-parameter, that means that there are lots of models of the same memory in memory and only the third model yield this error. I am using trainer.set_clip_threshold(1)

dumitrescustefan commented 6 years ago

Just a short comment, I'm getting this error too

 File "/work/incidents/nn_network3.py", line 189, in run_cg
    self.trainer.update()
  File "_dynet.pyx", line 5904, in _dynet.Trainer.update
  File "_dynet.pyx", line 5909, in _dynet.Trainer.update
RuntimeError: Magnitude of gradient is bad: inf

I'm also using auto-batching, running on CPU. The error is not fixed, as is pops up sometimes at the 2nd epoch, sometimes at the 10th, and each epoch takes ~8 hours.. My current way around this was to save the model at the end of each epoch and then restart fresh.. Maybe I guess resetting the trainer parameters helps get over this. However, for some reason now I'm getting this error at the same epoch, sometimes at 30%, just now at ~85%. Using set_clip_threshold doesn't help (same behavior). I'm using AdamTrainer, train loss does not increase, it currently wobbles, changing +- with less than 5% vs previous epoch. Network is a series of LSTMs all eventually ending up into a softmax.

kwalcock commented 4 years ago

This is still a problem in year 2020. From Java in the middle of a long training run, I'm seeing

03:14:38.980 [run-main-0] DEBUG org.clulab.ctxemb.Flair - Processed 153000 sentences. Cummulative loss: 0.6944733244576842.
[error] (run-main-0) java.lang.RuntimeException: unknown exception
[error] java.lang.RuntimeException: unknown exception

The exception details are unknown because SWIG is not configured to deal with the exception in dynet_swig.i

%catches(std::invalid_argument, ...);

but I believe and am working on confirming that it comes from this code:

float Trainer::clip_gradients() {
  float gscale = 1;
  if (clipping_enabled) {
    float gg = model->gradient_l2_norm();
    if (isnan(gg) || isinf(gg)) {
      ostringstream oss; oss << "Magnitude of gradient is bad: " << gg;
      throw std::runtime_error(oss.str());
    }
    if (gg > clip_threshold) {
      ++clips;
      ++clips_since_status;
      gscale = clip_threshold / gg;
    }
  }
  return gscale;
}

What can be done about it? If no standard procedure can be followed in the C++ code, can anyone suggest what can best be done if the exception is caught in application code?

@MihaiSurdeanu, you may want to follow this.

Thanks, all.

kwalcock commented 4 years ago

P.S. Is any of the code following the throw critical for an abandoned update? This includes the remainder of the code in the Trainer and whatever calls that. The updates variable won't be incremented, the weights won't decay, etc.

// this calls the rule-specific updates over all updated parameters
void Trainer::update() {
  const auto & params = model->parameters_list();
  const auto & lparams = model->lookup_parameters_list();

  // Allocate if necessary
  if(aux_allocated < params.size()) {
    aux_allocated = alloc_impl();
  }
  if(aux_allocated_lookup < lparams.size()) {
    aux_allocated_lookup = alloc_lookup_impl();
  }

  // Perform gradient clipping and cycle through parameters
  const float gscale = clip_gradients(); // <---- If this throws, is everything still in a consistent state?
  for(size_t i = 0; i < params.size(); ++i) {
    if(params[i]->updated) {
      update_params(gscale, i);
      params[i]->clear();
    }
  }
  for(size_t i = 0; i < lparams.size(); ++i) {
    auto &p = lparams[i];
    if (p->updated) {
      if(sparse_updates_enabled && !p->all_updated) {
        for (auto j : p->non_zero_grads)
          update_lookup_params(gscale, i, j);
      } else {
        update_lookup_params(gscale, i);
      }
      p->clear();
    }
  }
  ++updates;
  ++updates_since_status;

  L2WeightDecay & wd = model->get_weight_decay();
  wd.update_weight_decay(); // update global weight scale
  if (wd.parameters_need_rescaled())
    rescale_and_reset_weight_decay();  // if wdscale is getting to small multiply all weights by wdscale, and set wdscale to 1
}
kwalcock commented 4 years ago

The exception we're seeing is "Magnitude of gradient is bad: nan". I wonder if this code

https://github.com/clab/dynet/blob/7dfa70e418eef6504ac4ac5d622874886951f560/dynet/model.cc#L846

could be trying to take the sqrt of a negative number (or Inf or NaN). Is this sum guaranteed to be positive?

neubig commented 4 years ago

I think it should be guaranteed to be non-negative, which should be good enough for sqrt. If you can confirm that the value is negative before getting a crash and have some way to reproduce it we may be able to take a look.