Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

Continue training/fine tuning OPUS models leads to embedding size mismatch #47

Closed alaneckhardt closed 3 years ago

alaneckhardt commented 3 years ago

Bug description

I want to fine-tune Marian models trained on Opus. Particularly, I'm working currently with https://huggingface.co/Helsinki-NLP/opus-mt-en-cs I have two text files with sources and targets and I want to fine tune the model on this data. When I run the training script below, I get following error:

[2021-01-28 17:03:34] Initialize model weights with the pre-trained model /home/alan/data/46898-custom-mt-benefit-estimate-train-custom/opus-models/encs-train//opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz
[2021-01-28 17:03:34] Loading model from /home/alan/data/46898-custom-mt-benefit-estimate-train-custom/opus-models/encs-train//opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz
[2021-01-28 17:03:35] Training started
[2021-01-28 17:03:35] [data] Shuffling data
[2021-01-28 17:03:35] [data] Done reading 5000 sentences
[2021-01-28 17:03:35] [data] Done shuffling 5000 sentences to temp files
[2021-01-28 17:03:35] Error: Requested shape shape=1x32000 size=32000 for existing parameter 'decoder_ff_logit_out_b' does not match original shape shape=1x58100 size=58100
[2021-01-28 17:03:35] Error: Aborted from marian::Expr marian::ExpressionGraph::param(const string&, const marian::Shape&, marian::Ptr<marian::inits::NodeInitializer>&, marian::Type, bool, bool) in /root/marian/src/graph/expression_graph.h:317

[CALL STACK]
[0x55e8ccbe50ef]    marian::ExpressionGraph::  param  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  marian::Shape const&,  std::shared_ptr<marian::inits::NodeInitializer> const&,  marian::Type,  bool,  bool) + 0xf2f
[0x55e8cce5054d]    marian::mlp::Output::  lazyConstruct  (int)        + 0x24d
[0x55e8cce5a6cc]    marian::mlp::Output::  applyAsLogits  (IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>) + 0x6c
[0x55e8ccf33dc7]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::DecoderState>) + 0x1987
[0x55e8ccf36fad]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::DecoderState>) + 0x3fd
[0x55e8ccf4f5ed]    marian::EncoderDecoder::  stepAll  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>,  bool) + 0x21d
[0x55e8ccf402f4]    marian::models::EncoderDecoderCECost::  apply  (std::shared_ptr<marian::models::IModel>,  std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xd4
[0x55e8ccb54890]    marian::models::Trainer::  build  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xa0
[0x55e8ccfad805]    marian::SingletonGraph::  execute  (std::shared_ptr<marian::data::Batch>) + 0x95
[0x55e8ccc04bbc]    marian::Train<marian::SingletonGraph>::  run  ()   + 0x8ac
[0x55e8ccb323a2]    mainTrainer  (int,  char**)                        + 0x8a2
[0x55e8ccb10535]    main                                               + 0x35
[0x7f5fd86fe0b3]    __libc_start_main                                  + 0xf3
[0x55e8ccb3006a]    _start                                             + 0x2a

There should be a 32k vocabulary, but it seems that the model has actually 58100. I tried also en-de model and it has embedding size 65000. I tried using --dim-vocabs 58100 58100 but it didn't help, as the accompanied .spm models have really 32k vocab size.

How to reproduce

Here is a minimal example. There has to be text files sources.txt and targets.txt.

MODEL_DIR="my_model_dir"
marian \
  --model ${MODEL_DIR}/model.npz --type transformer \
  --pretrained-model ${MODEL_DIR}/opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz \
  --train-sets sources.txt targets.txt \
  --vocabs ${MODEL_DIR}/source.spm ${MODEL_DIR}/target.spm \
  --mini-batch 2  --maxi-batch 10 
alaneckhardt commented 3 years ago

Sorry, my mistake. I passed source.spm as vocabulary, it should have been opus.spm32k-spm32k.vocab.yml. Also the train and dev texts should be encoded using preprocess.sh. Closing and sorry for confusion.