Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

Incremental ENSK training failed using the provided assets #11

Closed stribizhev closed 4 years ago

stribizhev commented 4 years ago

Problem: incremental training using Marian 1.9 did not succeed using additional bilingual data and the ENSK model and SPM model downloaded from this repo.

Exception stack:

[2020-05-13 12:56:00] Error: Requested shape shape=1x32000 size=32000 for existing parameter 'decoder_ff_logit_out_b' does not match original shape shape=1x60024 size=60024
[2020-05-13 12:56:00] Error: Aborted from marian::Expr marian::ExpressionGraph::param(const string&, const marian::Shape&, marian::Ptr<marian::inits::NodeInitializer>&, marian::Type, bool, bool) in /marian/src/graph/expression_graph.h:317

[CALL STACK]
[0x56519d56f34f]    marian::ExpressionGraph::  param  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,  marian::Shape const&,  std::shared_ptr<marian::inits::NodeInitializer> const&,  marian::Type,  bool,  bool) + 0xf3f
[0x56519d7f414e]    marian::mlp::Output::  lazyConstruct  (int)        + 0x24e
[0x56519d7fe7ac]    marian::mlp::Output::  applyAsLogits  (IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>) + 0x6c
[0x56519d8d9b15]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::DecoderState>) + 0x1b15
[0x56519d8dcdde]    marian::DecoderTransformer::  step  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::DecoderState>) + 0x3ee
[0x56519d8f59a5]    marian::EncoderDecoder::  stepAll  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>,  bool) + 0x225
[0x56519d8e6603]    marian::models::EncoderDecoderCECost::  apply  (std::shared_ptr<marian::models::IModel>,  std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xf3
[0x56519d4d6742]    marian::models::Trainer::  build  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0xa2
[0x56519d94623d]                                                       + 0x70f23d
[0x56519d9d8383]    marian::NCCLCommunicator::  foreach  (std::function<void (unsigned long,unsigned long,unsigned long)> const&,  bool) const + 0x763
[0x56519d942b61]    marian::SyncGraphGroup::  initialize  (std::shared_ptr<marian::data::Batch> const&) + 0x61
[0x56519d94a72e]    marian::SyncGraphGroup::  update  (std::vector<std::shared_ptr<marian::data::Batch>,std::allocator<std::shared_ptr<marian::data::Batch>>>,  unsigned long) + 0x15e
[0x56519d94cd73]    marian::SyncGraphGroup::  update  (std::shared_ptr<marian::data::Batch>) + 0x283
[0x56519d596dcf]    marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x6ff
[0x56519d4b41b1]    mainTrainer  (int,  char**)                        + 0x221
[0x56519d475e35]    main                                               + 0x35
[0x7fa3a5b9eb97]    __libc_start_main                                  + 0xe7
[0x56519d4b256a]    _start                                             + 0x2a

Steps:

  1. Obtained the zip file for the EN-SK engine and extracted it in the working directory.
  2. Attempted to continue training using the EN-SK corpus, using the following command "${marian_path}/marian" -c "${cfg_file}" --model "${fldr}/opus.spm32k-spm32k.transformer-align.model1.npz.best-perplexity.npz" --no-restore-corpus, using source.spm and target.spm as the vocabulary files.
  3. This produced the following error: Error: Requested shape shape=32000x512 size=16384000 for existing parameter 'Wemb' does not match original shape shape=60024x512 size=30732288
  4. Upon examining the .spm files, it turned out that wc -l source.spm returned 69880, and wc -l target.spm returned 69359. There's also a file opus.spm32k-spm32k.vocab.yml*, wc -l opus.spm32k-spm32k.vocab.yml returns 60023.
  5. Retried the training using the same command, but with the opus.spm32k-spm32k.vocab.yml as the source and target vocabularies, but this produced the following error: Error: Detokenizing BLEU validator expects the target vocabulary to be SentencePieceVocab or FactoredVocab. Current vocabulary type is DefaultVocab

I suspect the SPM shipped with the ENSK model was trained on a different corpus.

Thank you for any hints how to solve the problem.

jorgtied commented 4 years ago

Our models are not trained with the internal sentencepiece segmentation and but we train the sentencepiece model before starting the MT training procedures. That means, that the data you train on must be segmented with that model before you can start MarianNMT. So, run spm_encode with the provided model and continue training with the marianNMT vocabulary in the json file. Do not add the spm-models to the call. That should work. Let e know if it doesn't.

jorgtied commented 4 years ago

Is this still an issue?

stribizhev commented 4 years ago

Yes, thank you, the issue is resolved.

zdposter commented 11 months ago

@stribizhev : Have you been successful with improving ENSK model? (I am sorry for this, but I did not find other way how to contact stribizhev...)

stribizhev commented 11 months ago

@stribizhev : Have you been successful with improving ENSK model? (I am sorry for this, but I did not find other way how to contact stribizhev...)

I have not been using these models for ages, I am involved in fine-tuning our own models. As for the issue in this ticket, the problem was that I did not know we can include JSON vocabulary files, I have always been using SPM files.