[QUESTION] Validation loss & PPL keep going up

Hi, so I was training 345m GPT2 using your example scripts examples/pretrain_gpt.sh. The validation loss and PPL, however, keep going up, while the training loss decreases as expected. My hyperparameters are shown here:

GPT_ARGS="
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 2 \
    --global-batch-size 4 \
    --lr 3.0e-4 \
    --train-iters 300000 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-impl mmap \
    --split 700,200,100
"

OUTPUT_ARGS="
    --log-interval 100 \
    --save-interval 50000 \
    --eval-interval 1000 \
    --eval-iters 10
"

Can anyone please tell me what is wrong? Should not the PPL decreases? Thanks!

NVIDIA / Megatron-LM

[QUESTION] Validation loss & PPL keep going up #787