IBM / pytorch-seq2seq

An open source framework for seq2seq models in PyTorch.
https://ibm.github.io/pytorch-seq2seq/public/index.html
Apache License 2.0
1.5k stars 376 forks source link

Dev branch: toy training stops after 2 epochs #174

Closed me2beats closed 6 years ago

me2beats commented 6 years ago

I test toy example. I use Google Colab

If I use master branch CPU then everything is good

But if I want to use GPU and run:

TRAIN_PATH='data/toy_reverse/train/data.txt'
DEV_PATH='data/toy_reverse/dev/data.txt'
# Start training
!python examples/sample.py --train_path $TRAIN_PATH --dev_path $DEV_PATH

then I get:

/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead.
  warnings.warn(warning.format(ret))
2018-10-28 15:59:34,815 root         INFO     Namespace(dev_path='data/toy_reverse/dev/data.txt', expt_dir='./experiment', load_checkpoint=None, log_level='info', resume=False, train_path='data/toy_reverse/train/data.txt')
/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/modules/rnn.py:38: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
2018-10-28 15:59:37,915 seq2seq.trainer.supervised_trainer INFO     Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
), Scheduler: None
Traceback (most recent call last):
  File "examples/sample.py", line 129, in <module>
    resume=opt.resume)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/trainer/supervised_trainer.py", line 186, in train
    teacher_forcing_ratio=teacher_forcing_ratio)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/trainer/supervised_trainer.py", line 103, in _train_epoches
    loss = self._train_batch(input_variables, input_lengths.tolist(), target_variables, model, teacher_forcing_ratio)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/trainer/supervised_trainer.py", line 55, in _train_batch
    teacher_forcing_ratio=teacher_forcing_ratio)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/models/seq2seq.py", line 48, in forward
    encoder_outputs, encoder_hidden = self.encoder(input_variable, input_lengths)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/seq2seq/models/EncoderRNN.py", line 68, in forward
    embedded = self.embedding(input_var)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/sparse.py", line 110, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 1110, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'

Then I tested dev branch. this error is no more, but training stops after 2 epochs. (??) And as result I have wrong output sequences. Is this ok?

I.e. I run

!scripts/toy.sh

TRAIN_SOURCE='data/toy_reverse/train/src.txt'
TRAIN_TARGET='data/toy_reverse/train/tgt.txt'
DEV_SOURCE='data/toy_reverse/dev/src.txt'
DEV_TARGET='data/toy_reverse/dev/tgt.txt'

# Start training
!python examples/sample.py $TRAIN_SOURCE $TRAIN_TARGET $DEV_SOURCE $DEV_TARGET

And I get error:

2018-10-28 19:41:05,614:root:INFO: train_source: data/toy_reverse/train/src.txt
2018-10-28 19:41:05,615:root:INFO: train_target: data/toy_reverse/train/tgt.txt
2018-10-28 19:41:05,615:root:INFO: dev_source: data/toy_reverse/dev/src.txt
2018-10-28 19:41:05,615:root:INFO: dev_target: data/toy_reverse/dev/tgt.txt
2018-10-28 19:41:05,615:root:INFO: experiment_directory: ./experiment
2018-10-28 19:41:09,111:seq2seq.trainer.supervised_trainer:INFO: Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    initial_lr: 0.001
    lr: 0.001
    weight_decay: 0
), Scheduler: <torch.optim.lr_scheduler.StepLR object at 0x7f83459c6ef0>
Train Perplexity: 22.6210: 100% 20/20 [00:01<00:00, 15.11it/s]
2018-10-28 19:41:10,528:seq2seq.trainer.supervised_trainer:INFO: Finished epoch 1: Train Perplexity: 11.3105, Dev Perplexity: 14.0211, Accuracy: 0.2420
Train Perplexity: 8.4440: 100% 20/20 [00:01<00:00, 13.06it/s]
2018-10-28 19:41:12,152:seq2seq.trainer.supervised_trainer:INFO: Finished epoch 2: Train Perplexity: 9.3128, Dev Perplexity: 7.3888, Accuracy: 0.3452
2018-10-28 19:41:12,153:root:INFO: Training time: 3.00s
Type in a source sequence: 1 2 3 4 5
['5', '5', '1', '1', '<eos>']
Type in a source sequence: 5 4 3 2 1
['1', '1', '1', '1', '<eos>']
Type in a source sequence: 
Diego999 commented 6 years ago

For the dev branch, you should change the parameters ;-) I got the same problems a couple of months ago. The batch & epoch in master were 32 & 6 and here they are 512 and 2. If you change them, it will work

me2beats commented 6 years ago

Working! Thanks sample.py batch_size=512 ---> batch_size=32 n_epochs=2 ---> n_epochs=6