Train on top of WMT19 model

scheiblr commented 4 years ago

Hey guys,

on this documentation of translation there is a download for the wmt19 en-de model which contains 4 model files. On torch hub there is the transformer.wmt19.en-de.single_model which consists of one model file. I prepared some data which I wanted to train on top of the model. Training solely with those data worked. I didn't get it run with one of the models above. With the 4file model I do not know how to setup it at all and with the 1model file I had dimensions mismatches. I put the model and everything into checkpoints and started training as follows:

DEST=data/extracted/data/bin
CUDA_VISIBLE_DEVICES=0 fairseq-train \
    $DEST \
    --arch transformer_wmt_en_de --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 4096 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --bpe fastbpe \
    --save-dir checkpoints \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

I get this error:

RuntimeError: Error(s) in loading state_dict for TransformerModel:
        size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([42024, 1024]) from checkpoint, the shape in current model is torch.Size([4088, 512]).
        size mismatch for encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).
[...]
size mismatch for encoder.layers.0.fc1.weight: copying a param with shape torch.Size([8192, 1024]) from checkpoint, the shape in current model is torch.Size([2048, 512]).
        size mismatch for encoder.layers.0.fc1.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([2048]).
        size mismatch for encoder.layers.0.fc2.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([512, 2048]).
        size mismatch for encoder.layers.0.fc2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([512]).
[...]
Exception: Cannot load model parameters from checkpoint checkpoints/checkpoint_last.pt; please ensure that the architectures match.

I also tried transformer_wmt_en_de_big, transformer_wmt_en_de_big_align, transformer_wmt_en_de_big_t2t and transformer.

So my questions are:

which architecture to choose?
what is the difference between the 1 model file WMT19 and the 4 model file WMT19?
would be possible to train on top of the 4 file model as well and if how would the setup look alike?

Thanks, in advance!

lematt1991 commented 4 years ago

You can instantiate the model this way:

en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de',
                       checkpoint_file='model1.pt:model2.pt:model3.pt:model4.pt',
                       tokenizer='moses', bpe='fastbpe')

Furthermore, you can inspect the arguments used by printing en2de.args. Printing en2de.args.arch shows transformer_wmt_en_de_big as the model architecture being used. Hope this helps!

scheiblr commented 4 years ago

Thanks a lot. That helped me to resolve some issues except for one. When I start training now, It loads everything, but then crashes with the following error message:

2020-09-11 17:08:42 | INFO | fairseq_cli.train | task: translation (TranslationTask)
2020-09-11 17:08:42 | INFO | fairseq_cli.train | model: transformer_wmt_en_de_big (TransformerModel)
2020-09-11 17:08:42 | INFO | fairseq_cli.train | criterion: label_smoothed_cross_entropy (LabelSmoothedCrossEntropyCriterion)
2020-09-11 17:08:42 | INFO | fairseq_cli.train | num. model params: 355811328 (num. trained: 355811328
[..]
2020-09-11 17:08:43 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2020-09-11 17:08:43 | INFO | fairseq_cli.train | max tokens per GPU = 4096 and max sentences per GPU = None
Traceback (most recent call last):
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/trainer.py", line 279, in load_checkpoint
    state["model"], strict=True, args=self.args
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/models/fairseq_model.py", line 93, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for TransformerModel:
        Missing key(s) in state_dict: "decoder.output_projection.weight". 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/scheiblr/anaconda3/envs/translation/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq_cli/train.py", line 350, in cli_main
    distributed_utils.call_main(args, main)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 254, in call_main
    main(args, **kwargs)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq_cli/train.py", line 114, in main
    disable_iterator_cache=task.has_sharded_data("train"),
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 173, in load_checkpoint
    reset_meters=reset_meters,
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/trainer.py", line 288, in load_checkpoint
    "please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint ./checkpoints/checkpoint_last.pt; please ensure that the architectures match.

Is it possible that this is due to unknowns when binarizing data?

2020-09-11 17:11:45 | INFO | fairseq_cli.preprocess | [en] Dictionary: 42024 types
2020-09-11 17:11:46 | INFO | fairseq_cli.preprocess | [en] data/train.en: 36140 sents, 402235 tokens, 0.00224% replaced by <unk>
2020-09-11 17:11:46 | INFO | fairseq_cli.preprocess | [en] Dictionary: 42024 types
2020-09-11 17:11:46 | INFO | fairseq_cli.preprocess | [en] data/valid.en: 4517 sents, 50159 tokens, 0.00199% replaced by <unk>
2020-09-11 17:11:46 | INFO | fairseq_cli.preprocess | [en] Dictionary: 42024 types
2020-09-11 17:11:47 | INFO | fairseq_cli.preprocess | [en] data/test.en: 4518 sents, 50652 tokens, 0.00395% replaced by <unk>
2020-09-11 17:11:47 | INFO | fairseq_cli.preprocess | [de] Dictionary: 42024 types
2020-09-11 17:11:48 | INFO | fairseq_cli.preprocess | [de] data/train.de: 36140 sents, 500162 tokens, 0.0032% replaced by <unk>
2020-09-11 17:11:48 | INFO | fairseq_cli.preprocess | [de] Dictionary: 42024 types
2020-09-11 17:11:48 | INFO | fairseq_cli.preprocess | [de] data/valid.de: 4517 sents, 62528 tokens, 0.0064% replaced by <unk>
2020-09-11 17:11:48 | INFO | fairseq_cli.preprocess | [de] Dictionary: 42024 types
2020-09-11 17:11:48 | INFO | fairseq_cli.preprocess | [de] data/test.de: 4518 sents, 62872 tokens, 0.00318% replaced by <unk>
2020-09-11 17:11:48 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to data/bin

lematt1991 commented 4 years ago

Can you provide the full command that you ran

lematt1991 commented 4 years ago

I think you may be missing --share-decoder-input-output-embed

scheiblr commented 4 years ago

Thank you, that helped a step further. Still I get an error, though:

RuntimeError: The size of tensor a (269746176) must match the size of tensor b (312778752) at non-singleton dimension 0

I ran the entire stuff with the following settings this time:

2020-09-14 09:49:39 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0.0, all_gather_list_size=16384, arch='transformer_wmt_en_de_big', attention_dropout=0.1, best_checkpoint_metric='loss', bf16=False, bpe='fastbpe', bpe_codes='./checkpoints/bpecodes', broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layerdrop=0.0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.2, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=8192, encoder_layerdrop=0.0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, layernorm_embedding=False, left_pad_source='True', left_pad_target='False', load_alignments=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0007], lr_scheduler='inverse_sqrt', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=3584, max_tokens_valid=3584, max_update=201800, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, model_parallel_size=1, moses_no_dash_splits=False, moses_no_escape=False, moses_source_lang='en', moses_target_lang='de', no_cross_attention=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=True, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=1, num_batch_buckets=0, num_workers=0, optimizer='adam', optimizer_overrides='{}', patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=None, pipeline_devices=None, pipeline_model_parallel=False, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='./checkpoints', save_interval=1, save_interval_updates=200, scoring='bleu', seed=2, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang='en', stop_time_hours=0, target_lang='de', task='translation', tensorboard_logdir='./logdir', threshold_loss_scale=None, tie_adaptive_weights=False, tokenizer='moses', tpu=False, train_subset='train', truncate_source=False, update_freq=[1], upsample_primary=1, use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0, zero_sharding='none')

edit, the cli was:

CUDA_VISIBLE_DEVICES=0 fairseq-train \
data/bin \
--no-epoch-checkpoints \
--activation-dropout 0.0 --activation-fn relu \
--adam-betas (0.9,0.98) --adam-eps 1e-08 \
--adaptive-softmax-dropout 0 --arch transformer_wmt_en_de_big \
--attention-dropout 0.1 --bpe fastbpe \
--bpe-codes ./checkpoints/bpecodes --bucket-cap-mb 25 \
--clip-norm 0.0 --criterion label_smoothed_cross_entropy \
--ddp-backend c10d --decoder-attention-heads 16 \
--decoder-embed-dim 1024 --decoder-ffn-embed-dim 4096 \
--decoder-layerdrop 0 --decoder-layers 6 --decoder-output-dim 1024 \
--distributed-backend nccl --distributed-port -1 --distributed-rank 0 \
--distributed-world-size 1 --dropout 0.2 --encoder-attention-heads 16 \
--encoder-embed-dim 1024 --encoder-ffn-embed-dim 8192 \
--encoder-layerdrop 0 --encoder-layers 6 --eval-bleu-detok space \
--fp16 --fp16-init-scale 128 --fp16-scale-tolerance 0.0 --keep-interval-updates -1 \
--keep-last-epochs -1 --label-smoothing 0.1 --log-format simple \
--log-interval 100 --lr 0.0007 --lr-scheduler inverse_sqrt \
--max-epoch 1 --max-source-positions 1024 --max-target-positions 1024 \
--max-tokens 3584 --min-loss-scale 0.0001 --min-lr 1e-09 --moses-source-lang en --moses-target-lang de \
--no-progress-bar --num-batch-buckets 0 --num-workers 0 --optimizer adam --optimizer-overrides {} \
--quant-noise-pq 0 --quant-noise-pq-block-size 8 --quant-noise-scalar 0 --relu-dropout 0.0 \
--restore-file checkpoint_last.pt --save-dir ./checkpoints --save-interval 1 \
--save-interval-updates 200 --seed 2 --source-lang en --target-lang de \
--task translation --tokenizer moses --train-subset train --upsample-primary 1 \
--valid-subset valid --validate-interval 1 --warmup-init-lr 1e-07 --warmup-updates 4000 \
--weight-decay 0.0 --share-decoder-input-output-embed

lematt1991 commented 4 years ago

Can you provide a full traceback running with CUDA_LAUNCH_BLOCKING=1

scheiblr commented 4 years ago

2020-09-15 13:28:15 | INFO | fairseq_cli.train | task: translation (TranslationTask)
2020-09-15 13:28:15 | INFO | fairseq_cli.train | model: transformer_wmt_en_de_big (TransformerModel)
2020-09-15 13:28:15 | INFO | fairseq_cli.train | criterion: label_smoothed_cross_entropy (LabelSmoothedCrossEntropyCriterion)
2020-09-15 13:28:15 | INFO | fairseq_cli.train | num. model params: 312778752 (num. trained: 312778752)
2020-09-15 13:28:18 | INFO | fairseq.trainer | detected shared parameter: decoder.embed_tokens.weight <- decoder.output_projection.weight
2020-09-15 13:28:18 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2020-09-15 13:28:18 | INFO | fairseq.utils | rank   0: capabilities =  7.5  ; total memory = 23.650 GB ; name = TITAN RTX                               
2020-09-15 13:28:18 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2020-09-15 13:28:18 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2020-09-15 13:28:18 | INFO | fairseq_cli.train | max tokens per GPU = 3584 and max sentences per GPU = None
2020-09-15 13:28:29 | INFO | fairseq.trainer | loaded checkpoint ./checkpoints/checkpoint_last.pt (epoch 20 @ 201800 updates)
2020-09-15 13:28:29 | INFO | fairseq.trainer | loading train data for epoch 20
2020-09-15 13:28:29 | INFO | fairseq.data.data_utils | loaded 36140 examples from: data/bin/train.en-de.en
2020-09-15 13:28:29 | INFO | fairseq.data.data_utils | loaded 36140 examples from: data/bin/train.en-de.de
2020-09-15 13:28:29 | INFO | fairseq.tasks.translation | data/bin train en-de 36140 examples
2020-09-15 13:28:29 | INFO | fairseq.trainer | begin training epoch 21
Traceback (most recent call last):
  File "/home/scheiblr/anaconda3/envs/translation/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq_cli/train.py", line 350, in cli_main
    distributed_utils.call_main(args, main)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 254, in call_main
    main(args, **kwargs)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq_cli/train.py", line 125, in main
    valid_losses, should_stop = train(args, trainer, task, epoch_itr)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq_cli/train.py", line 207, in train
    log_output = trainer.train_step(samples)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/trainer.py", line 576, in train_step
    raise e
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/trainer.py", line 557, in train_step
    self.optimizer.step()
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/optim/fp16_optimizer.py", line 180, in step
    self.fp32_optimizer.step(closure)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/optim/fairseq_optimizer.py", line 111, in step
    self.optimizer.step(closure)
  File "/home/scheiblr/anaconda3/envs/translation/lib/python3.7/site-packages/fairseq/optim/adam.py", line 187, in step
    exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
RuntimeError: The size of tensor a (269746176) must match the size of tensor b (312778752) at non-singleton dimension 0

scheiblr commented 4 years ago

It runs now. Thanks a lot!

facebookresearch / fairseq

Train on top of WMT19 model #2553