Open talbaumel opened 7 years ago
Thanks for the report! Are you using the latest code? I fixed a bug yesterday that might have been causing a problem like this on GPU.
If you're using the latest pull, could you share the code for a minimal reproducible example?
I had the same problem with a similar model (LSTM-VAE) on this commit 55266dc7fed7b3fc90dab0b2d2a7223b41c541db ("Fixed some doc")
As for a minimal reproducible example... I'll see what I can do
I got the error after few hours of training, not sure about getting a minimal example. Looking over the commits history I can see I installed it before the bug fix yesterday I'll reinstall, test it and update sometime if the training succeeded
Thanks!
The error re appeared, this time after 20 hours
`--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last)
Hmm, and just to confirm: this error doesn't occur when you set --dynet-autobatch 0
with the other settings exactly the same, correct? It won't be possible to debug without some way to reproduce the problem, but if we can get code/data that causes the problem to occur consistently (hopefully in a much shorter time than 20 hours), that would really help.
This issue seems to be fixed for me... Maybe the lookup parameter fix fixed that as well... Not sure honestly where this came from.
It also saves a lot of time, which is really cool
Managed to run through my entire dataset with lower batch size
22 hours with auto-batching 35 without!
Hi,
Sorry to reopen this issue,
Is there anyway to avoid RuntimeError: Magnitude of gradient is bad: inf
?
Background:
It happens during hyper-parameter, that means that there are lots of models of the same memory in memory and only the third model yield this error.
I am using trainer.set_clip_threshold(1)
Just a short comment, I'm getting this error too
File "/work/incidents/nn_network3.py", line 189, in run_cg
self.trainer.update()
File "_dynet.pyx", line 5904, in _dynet.Trainer.update
File "_dynet.pyx", line 5909, in _dynet.Trainer.update
RuntimeError: Magnitude of gradient is bad: inf
I'm also using auto-batching, running on CPU. The error is not fixed, as is pops up sometimes at the 2nd epoch, sometimes at the 10th, and each epoch takes ~8 hours.. My current way around this was to save the model at the end of each epoch and then restart fresh.. Maybe I guess resetting the trainer parameters helps get over this. However, for some reason now I'm getting this error at the same epoch, sometimes at 30%, just now at ~85%. Using set_clip_threshold doesn't help (same behavior). I'm using AdamTrainer, train loss does not increase, it currently wobbles, changing +- with less than 5% vs previous epoch. Network is a series of LSTMs all eventually ending up into a softmax.
This is still a problem in year 2020. From Java in the middle of a long training run, I'm seeing
03:14:38.980 [run-main-0] DEBUG org.clulab.ctxemb.Flair - Processed 153000 sentences. Cummulative loss: 0.6944733244576842.
[error] (run-main-0) java.lang.RuntimeException: unknown exception
[error] java.lang.RuntimeException: unknown exception
The exception details are unknown because SWIG is not configured to deal with the exception in dynet_swig.i
%catches(std::invalid_argument, ...);
but I believe and am working on confirming that it comes from this code:
float Trainer::clip_gradients() {
float gscale = 1;
if (clipping_enabled) {
float gg = model->gradient_l2_norm();
if (isnan(gg) || isinf(gg)) {
ostringstream oss; oss << "Magnitude of gradient is bad: " << gg;
throw std::runtime_error(oss.str());
}
if (gg > clip_threshold) {
++clips;
++clips_since_status;
gscale = clip_threshold / gg;
}
}
return gscale;
}
What can be done about it? If no standard procedure can be followed in the C++ code, can anyone suggest what can best be done if the exception is caught in application code?
@MihaiSurdeanu, you may want to follow this.
Thanks, all.
P.S. Is any of the code following the throw critical for an abandoned update? This includes the remainder of the code in the Trainer and whatever calls that. The updates variable won't be incremented, the weights won't decay, etc.
// this calls the rule-specific updates over all updated parameters
void Trainer::update() {
const auto & params = model->parameters_list();
const auto & lparams = model->lookup_parameters_list();
// Allocate if necessary
if(aux_allocated < params.size()) {
aux_allocated = alloc_impl();
}
if(aux_allocated_lookup < lparams.size()) {
aux_allocated_lookup = alloc_lookup_impl();
}
// Perform gradient clipping and cycle through parameters
const float gscale = clip_gradients(); // <---- If this throws, is everything still in a consistent state?
for(size_t i = 0; i < params.size(); ++i) {
if(params[i]->updated) {
update_params(gscale, i);
params[i]->clear();
}
}
for(size_t i = 0; i < lparams.size(); ++i) {
auto &p = lparams[i];
if (p->updated) {
if(sparse_updates_enabled && !p->all_updated) {
for (auto j : p->non_zero_grads)
update_lookup_params(gscale, i, j);
} else {
update_lookup_params(gscale, i);
}
p->clear();
}
}
++updates;
++updates_since_status;
L2WeightDecay & wd = model->get_weight_decay();
wd.update_weight_decay(); // update global weight scale
if (wd.parameters_need_rescaled())
rescale_and_reset_weight_decay(); // if wdscale is getting to small multiply all weights by wdscale, and set wdscale to 1
}
The exception we're seeing is "Magnitude of gradient is bad: nan". I wonder if this code
https://github.com/clab/dynet/blob/7dfa70e418eef6504ac4ac5d622874886951f560/dynet/model.cc#L846
could be trying to take the sqrt of a negative number (or Inf or NaN). Is this sum guaranteed to be positive?
I think it should be guaranteed to be non-negative, which should be good enough for sqrt. If you can confirm that the value is negative before getting a crash and have some way to reproduce it we may be able to take a look.
Tested the new auto-batching on a seq2seq model and got:
RuntimeError: Magnitude of gradient is bad: inf