valid_best_loss not actually the best

🐛 Bug

"valid_best_loss" reported in the logs is not the best but rather the value of the first epoch.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

      fairseq-train \
         $data_dir \
         --seed 582 \
         --arch transformer \
         --optimizer adam \
         --adam-betas '(0.9, 0.98)' \
         --clip-norm 0.0\
         --lr 5e-4 \
         --lr-scheduler inverse_sqrt \
         --min-lr '1e-09' \
         --warmup-updates 4000\
         --warmup-init-lr '1e-07' \
         --dropout 0.3 \
         --weight-decay 0.0001\
         --criterion label_smoothed_cross_entropy \
         --label-smoothing 0.1 \
         --max-tokens 4096 \
         --max-update 300000 \
         --update-freq 4 \
         --tensorboard-logdir model/logs/tensorboard \
         --save-dir model \
         --tensorboard-logdir model \
         --log-format json \
         --keep-last-epochs 10 \
         --maximize-best-checkpoint-metric

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer', attention_dropout=0.0, best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', cross_self_attention=False, curriculum=0, data='data', dataset_impl=None, ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:19541', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.3, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=False, encoder_normalize_before=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=10, label_smoothing=0.1, layer_wise_attention=False, layernorm_embedding=False, lazy_load=False, left_pad_source='True', left_pad_target='False', load_alignments=False, log_format='json', log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4096, max_tokens_valid=4096, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_cross_attention=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_scale_embedding=False, no_token_positional_embeddings=False, num_workers=1, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='model', save_interval=1, save_interval_updates=0, seed=582, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', tensorboard_logdir='model', threshold_loss_scale=None, tokenizer=None, train_subset='train', truncate_source=False, update_freq=[4], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0001)

| model transformer, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 51720192 (num. trained: 51720192)
| training on 4 GPUs
| max tokens per GPU = 4096 and max sentences per GPU = None
| no existing checkpoint found model/checkpoint_last.pt
| loading train data for epoch 0
| loaded 1293348 examples from: data/train.iu-en.iu
| loaded 1293348 examples from: data/train.iu-en.en
| data train iu-en 1293348 examples
| NOTICE: your device may support faster training with --fp16
{"epoch": 1, "train_loss": "9.433", "train_nll_loss": "8.973", "train_ppl": "502.42", "train_wps": "92565", "train_ups": "2", "train_wpb": "58247.305", "train_bsz": "3086.749", "train_num_updates": "419", "train_lr": "5.24645e-05", "train_gnorm": "1.646", "train_clip": "0.000", "train_oom": "0.000", "train_wall": "279", "train_train_wall": "249"}
{"epoch": 1, "valid_loss": "8.451", "valid_nll_loss": "7.769", "valid_ppl": "218.09", "valid_num_updates": "419"}
| saved checkpoint model/checkpoint1.pt (epoch 1 @ 419 updates) (writing took 2.014841318130493 seconds)
...
{"epoch": 401, "train_loss": "2.898", "train_nll_loss": "1.428", "train_ppl": "2.69", "train_wps": "93509", "train_ups": "2", "train_wpb": "58247.305", "train_bsz": "3086.749", "train_num_updates": "168019", "train_lr": "7.71473e-05", "train_gnorm": "0.359", "train_clip": "0.000", "train_oom": "0.000", "train_wall": "107362", "train_train_wall": "98269"}
{"epoch": 401, "valid_loss": "3.580", "valid_nll_loss": "2.113", "valid_ppl": "4.33", "valid_num_updates": "168019", "valid_best_loss": "8.45141"}
| saved checkpoint model/checkpoint401.pt (epoch 401 @ 168019 updates) (writing took 1.8081376552581787 seconds)

Note that the valid_best_lost at epoch 401 is the same as valid_loss at epoch 1 and stays like that through out training. Looks like it should be better when getting smaller.

Code sample

Expected behavior

valid_best_lost should go down when the training criterion is label_smoothed_cross_entropy as such, it should not be stuck reflecting the value at epoch 1. In the example above, it should at least be 3.580

Environment

Installation procedure

conda \
   create \
      --yes \
      --name fairseq-0.9.0 \
      --channel conda-forge \
      --channel pytorch \
      python=3 \
      cudatoolkit=10 \
      cudnn \
      nvidia-apex \
      pytorch=1.2 \
      tensorboard

conda activate fairseq-0.9.0

pip \
    install \
        fairseq==0.9.0 \
        cython \
        regex \
        typing \
        portalocker \
        sacrebleu \
        tensorboardX

Resulting in

name: fairseq-0.9.0
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_llvm
  - absl-py=0.9.0=py37_0
  - asn1crypto=1.3.0=py37_0
  - attrs=19.3.0=py_0
  - blinker=1.4=py37_0
  - c-ares=1.15.0=h7b6447c_1001
  - ca-certificates=2020.1.1=0
  - cachetools=3.1.1=py_0
  - certifi=2019.11.28=py37_0
  - cffi=1.13.2=py37h8022711_0
  - chardet=3.0.4=py37_1003
  - click=7.0=py37_0
  - cryptography=2.8=py37h1ba5d50_0
  - cudatoolkit=10.0.130=0
  - cudnn=7.6.5=cuda10.0_0
  - cxxfilt=0.2.0=py37he1b5a44_0
  - google-auth=1.11.2=py_0
  - google-auth-oauthlib=0.4.1=py_2
  - grpcio=1.27.2=py37hf8bcb03_0
  - idna=2.8=py37_0
  - importlib_metadata=1.5.0=py37_0
  - ld_impl_linux-64=2.33.1=h53a641e_8
  - libblas=3.8.0=15_openblas
  - libcblas=3.8.0=15_openblas
  - libffi=3.2.1=he1b5a44_1006
  - libgcc-ng=9.2.0=h24d8f2e_2
  - libgfortran-ng=7.3.0=hdf63c60_5
  - liblapack=3.8.0=15_openblas
  - libopenblas=0.3.8=h5ec1e0e_0
  - libprotobuf=3.11.4=hd408876_0
  - libstdcxx-ng=9.2.0=hdf63c60_2
  - llvm-openmp=9.0.1=hc9558a2_2
  - markdown=3.1.1=py37_0
  - mkl=2020.0=166
  - more-itertools=8.2.0=py_0
  - ncurses=6.1=hf484d3e_1002
  - ninja=1.10.0=hc9558a2_0
  - numpy=1.18.1=py37h95a1406_0
  - nvidia-apex=0.1=py37h8d9616a_1
  - oauthlib=3.1.0=py_0
  - openssl=1.1.1d=h7b6447c_4
  - packaging=20.1=py_0
  - pip=20.0.2=py_2
  - pluggy=0.12.0=py_0
  - protobuf=3.11.4=py37he6710b0_0
  - py=1.8.1=py_0
  - pyasn1=0.4.8=py_0
  - pyasn1-modules=0.2.7=py_0
  - pycparser=2.19=py_2
  - pyjwt=1.7.1=py37_0
  - pyopenssl=19.1.0=py37_0
  - pyparsing=2.4.6=py_0
  - pysocks=1.7.1=py37_0
  - pytest=5.3.5=py37_1
  - python=3.7.6=h357f687_2
  - pytorch=1.2.0=py3.7_cuda10.0.130_cudnn7.6.2_0
  - pyyaml=5.3=py37h516909a_0
  - readline=8.0=hf8c457e_0
  - requests=2.23.0=py37_0
  - requests-oauthlib=1.3.0=py_0
  - rsa=4.0=py_0
  - setuptools=45.2.0=py37_0
  - six=1.14.0=py37_0
  - sqlite=3.30.1=hcee41ef_0
  - tensorboard=2.1.0=py3_0
  - tk=8.6.10=hed695b0_0
  - tqdm=4.43.0=py_0
  - urllib3=1.25.8=py37_0
  - wcwidth=0.1.8=py_0
  - werkzeug=1.0.0=py_0
  - wheel=0.34.2=py_1
  - xz=5.2.4=h14c3975_1001
  - yaml=0.2.2=h516909a_1
  - zipp=3.0.0=py_0
  - zlib=1.2.11=h516909a_1006
  - pip:
    - cython==0.29.15
    - fairseq==0.9.0
    - portalocker==1.5.2
    - regex==2020.2.20
    - sacrebleu==1.4.3
    - tensorboardx==2.0
    - typing==3.7.4.1
prefix: /space/group/nrc_ict/project/pkgs/ubuntu18/gcc-7.3.0/miniconda3/envs/fairseq-0.9.0

Additional context

Also, the best checkpoint is not the right one since the best checkpoint is the checkpoint produced at epoch 1.

ll -rt model/
total 7274048
-rw-r----- 1 sam037 nrc_ict 620776205 Mar 24 10:42 checkpoint_best.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 15:56 checkpoint394.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:00 checkpoint395.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:05 checkpoint396.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:09 checkpoint397.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:13 checkpoint398.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:18 checkpoint399.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:22 checkpoint400.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:27 checkpoint401.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:31 checkpoint402.pt
drwxr-x--- 2 sam037 nrc_ict     49152 Mar 25 16:36 valid
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:36 checkpoint403.pt
-rw-r----- 1 sam037 nrc_ict 620776722 Mar 25 16:36 checkpoint_last.pt
drwxr-x--- 2 sam037 nrc_ict     49152 Mar 25 16:36 train

facebookresearch / fairseq