Closed TijsRozenbroek closed 4 years ago
I have similar issues but NO changing lr even after changing the configurations. In my case, I changed lr and lrcrit fro .2 to .02 in conf file
--lr=0.02
--lrcrit=0.02
So I can figured out 00X_config file has new value, 0.02,
but log still printing .2 like this.
epoch: 181 | nupdates: 123000 | lr: 0.200000 | lrcriterion: 0.200000 | runtime: 00:05:06 | bch(ms): 306.27 | smp(ms): 0.85 | fwd(ms): 36.51 | crit-fwd(ms): 0.50 | bwd(ms): 250.96 | optim(ms): 15.03 | loss: 2.07057 | train-TER: 14.47 | train-WER: 25.12 | dev-loss: 0.10144 | dev-TER: 1.71 | dev-WER: 3.55 | vali-loss: 0.24045 | vali-TER: 2.42 | vali-WER: 5.27 | avg-isz: 274 | avg-tsz: 018 | max-tsz: 072 | hrs: 24.38 | thrpt(sec/sec): 286.61
I think "continue" mode has still some bug. Looking through it..
I found that this worked for my case: In train.cpp, change this
if (runStatus == kTrainMode || runStatus == kForkMode) {
netoptim = initOptimizer(
{network}, FLAGS_netoptim, FLAGS_lr, FLAGS_momentum, FLAGS_weightdecay);
critoptim =
initOptimizer({criterion}, FLAGS_critoptim, FLAGS_lrcrit, 0.0, 0.0);
}
to
if (runStatus == kTrainMode || runStatus == kForkMode || runStatus == kContinueMode) {
netoptim = initOptimizer(
{network}, FLAGS_netoptim, FLAGS_lr, FLAGS_momentum, FLAGS_weightdecay);
critoptim =
initOptimizer({criterion}, FLAGS_critoptim, FLAGS_lrcrit, 0.0, 0.0);
}
And rebuild.
$ cd build;make -j 8
Could you try this?
Glad it fixed your problem, unfortunately it didn't fix mine.
Now the log for epoch 201 is as follows:
epoch: 201 | nupdates: 1023795 | lr: 0.000154 | lrcriterion: 0.000154 | runtime: 00:04:28 | bch(ms): 52.64 | smp(ms): 0.23 | fwd(ms): 19.23 | crit-fwd(ms): 2.45 | bwd(ms): 26.86 | optim(ms): 5.80 | loss: 75.34230 | train-LER: 32.27 | train-WER: 38.92 | atccvalid-loss: 187.23566 | atccvalid-LER: 79.39 | atccvalid-WER: 89.88 | atcosimvalid-loss: 15.91293 | atcosimvalid-LER: 18.58 | atcosimvalid-WER: 24.52 | avg-isz: 892 | avg-tsz: 144 | max-tsz: 5996 | hrs: 25.25 | thrpt(sec/sec): 338.95
Good to here that:) It would be appreciated if you clarify what the problem was in your case.
I'm afraid you misread my comment, the problem described in my opening comment still persists after trying your fix.
Oh I misread ;) Have you ever tried fork mode instead of continue mode?
Yes, as I mentioned in my initial post, I did. See the quote below.
When I use fork, the learning rate is correct, however I think using the continue mode would be better, as it is intended exactly for this purpose, as stated in the wiki:
Continue training a saved model. This can be used for example to fine-tune with a smaller learning rate. The continue option makes a best effort to resume training from the most recent checkpoint of a given model as if there were no interruptions.
@TijsRozenbroek,
Try to add after this line https://github.com/facebookresearch/wav2letter/blob/master/Train.cpp#L250
netoptim->setLr(FLAGS_lr);
However, for this case we implement in the way to use fork, because then it is simpler to reproduce what model, how long, with which lr one trained (not parsing the logs and checking lr there) and also maybe one changed other parameters too (another scheduling of lr/momentum, etc).
Hi, thanks for coming to help out.
When I try your fix by adding that line to Train.cpp and rebuilding, the 201st epoch log is as follows:
epoch: 201 | nupdates: 1023795 | lr: 0.000154 | lrcriterion: 0.000182 | runtime: 00:04:30 | bch(ms): 53.17 | smp(ms): 0.19 | fwd(ms): 19.43 | crit-fwd(ms): 2.47 | bwd(ms): 27.03 | optim(ms): 5.97 | loss: 75.33963 | train-LER: 32.27 | train-WER: 38.92 | atccvalid-loss: 187.23625 | atccvalid-LER: 79.39 | atccvalid-WER: 89.88 | atcosimvalid-loss: 15.93031 | atcosimvalid-LER: 18.59 | atcosimvalid-WER: 24.54 | avg-isz: 892 | avg-tsz: 144 | max-tsz: 5996 | hrs: 25.25 | thrpt(sec/sec): 335.61
This is equal to the learning rate after trying the fix suggested by @deepspiking and thus still not correct. (As a side note, you can also see that lrcriterion is different, I suppose critoptim->setLr(FLAGS_lrcrit);
should be added to fix that.)
So unfortunately the issue still persists. I will continue using fork for the time being.
Yep, with critoptim->setLr(FLAGS_lrcrit);
you should be able to change criterion learning rate too (forgot that you have s2s criterion, not ctc - for ctc there is no parameters to learn).
@vineelpratap, any idea why these two fixes above doesn't help?
hi @TijsRozenbroek, I see how this is confusing. I think continue
was not really designed to for changing configs, and we should update our wiki accordingly.
For your case, I think the problem is that we're correctly setting the initial learning rate, but we're still using the learning rate schedule to calculate the lr for this particular update (1023795
).
epoch: 201 | nupdates: 1023795 | lr: 0.000154 | lrcriterion: 0.000182 | runtime: 00:04:30 | bch(ms): 53.17 | smp(ms): 0.19 | fwd(ms): 19.43 | crit-fwd(ms): 2.47 | bwd(ms): 27.03 | optim(ms): 5.97 | loss: 75.33963 | train-LER: 32.27 | train-WER: 38.92 | atccvalid-loss: 187.23625 | atccvalid-LER: 79.39 | atccvalid-WER: 89.88 | atcosimvalid-loss: 15.93031 | atcosimvalid-LER: 18.59 | atcosimvalid-WER: 24.54 | avg-isz: 892 | avg-tsz: 144 | max-tsz: 5996 | hrs: 25.25 | thrpt(sec/sec): 335.61
lr = (lr_gamma ^ (num_updates / lr_step_size) * init_lr = 0.5^(1023795/203740)*0.005 =0.000154
So the initial learning rate is set correctly, but we're applying a learning rate schedule as if we were on update 1023795.
For your case, I think fork
is more appropriate for what you are trying to do. However, If you really wanted to use continue I suppose you could also set the lr_gamma to 1, so our learning rate does not decay, or set the initial learning rate so that the learning rate is equal to 0.005
at update 1023795
.
Hi @padentomasello, thanks for your answer! It really cleared up my confusion.
I'll close this issue now. It would indeed be great if the wiki could be updated accordingly to prevent confusion for others in the future.
I have trained the provided seq2seq TDS model (from https://github.com/facebookresearch/wav2letter/tree/master/recipes/models/seq2seq_tds) on my own data for 200 epochs with a learning rate of 0.2 using the following config parameters (left out irrelevant rundir parameters etc.):
The last epoch is logged as follows:
After this I want to train for another 100 epochs with a learning rate starting at 0.005, using the following config parameters (again left out irrelevant rundir parameters etc.):
However when running the command
wav2letter/build/Train continue atc/run/seq2seq_tds_distributed_atc_lr02it200/ --flagsfile /home/tijs/atc/config/train.cfg
and inspecting the training log, it becomes clear that the learning rate seems incorrect, as it starts at 0.000182 See the log below:I also attempted to pass the learning rate as parameters on the command line using --lr=0.005 and --lrcrit=0.005, to no avail.
When I use fork, the learning rate is correct, however I think using the continue mode would be better, as it is intended exactly for this purpose, as stated in the wiki:
Could you please tell me whether I am overlooking something, or if something else is wrong. Thanks in advance!