facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

RuntimeError: invalid argument 2: sizes do not match when run train.py #74

Closed playma closed 6 years ago

playma commented 6 years ago

I run the train.py below

python3 train.py $DATA_DIR \
      --lr 0.5 --clip-norm 0.1 --dropout 0 --max-tokens 8000 \
      --arch fconv_iwslt_de_en \
      --save-dir $MODEL_DIR \
      --max-epoch 100 \
      > $MODEL_DIR/log.txt

And I got the error

Exception ignored in: <module 'threading' from '/usr/lib/python3.5/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 1279, in _shutdown
    tlock = _main_thread._tstate_lock
KeyboardInterrupt
Traceback (most recent call last):
  File "train.py", line 269, in <module>
    main()
  File "train.py", line 87, in main
    extra_state = trainer.load_checkpoint(checkpoint_path)
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_trainer.py", line 131, in load_checkpoint
    for rank in range(self.num_replicas)
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_event_loop.py", line 162, in gen_list
    return [g.gen() for g in gens]
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_event_loop.py", line 162, in <listcomp>
    return [g.gen() for g in gens]
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_event_loop.py", line 158, in gen
    return next(self.generator)
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_event_loop.py", line 37, in result_generator
    yield self.return_pipes[rank].recv()
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_event_loop.py", line 91, in _signal_handler
    raise Exception(msg)
Exception:

-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_event_loop.py", line 134, in _process_event_loop
    return_pipe.send(action_fn(rank, device_id, **kwargs))
  File "/home/playma/Research/origin/fairseq-py/fairseq/multiprocessing_trainer.py", line 139, in _async_load_checkpoint
    self.lr_scheduler, cuda_device=device_id)
  File "/home/playma/Research/origin/fairseq-py/fairseq/utils.py", line 99, in load_state
    model.load_state_dict(state['model'])
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 388, in load_state_dict
    own_state[name].copy_(param)
RuntimeError: invalid argument 2: sizes do not match at /home/playma/Research/pytorch/torch/lib/THC/THCTensorCopy.cu:31
playma commented 6 years ago

This is my preprocess command

    python3 preprocess.py --source-lang source --target-lang target \
      --trainpref $ORIGIN_DIR/train --validpref $ORIGIN_DIR/valid --testpref $ORIGIN_DIR/test \
      --destdir $DATA_DIR \
      --nwordstgt 8000 \
      --nwordssrc 8000
playma commented 6 years ago

I solved the problem that the command will automatically read the saved model instead of overwrite the model