There is argument behaviour change in latest megatron-lm repo as below:
In Megatron-LM-23*, qk factor is by default enabled which helps make training stable especially in fp16 case.
group.add_argument('--no-query-key-layer-scaling', action='store_false',
help='Do not scale Q * K^T by 1 / layer-number.',
dest='apply_query_key_layer_scaling')
group.add_argument('--attention-softmax-in-fp32', action='store_true',
help='Run attention masking and softmax in fp32. '
'This flag is ignored unless '
'--no-query-key-layer-scaling is specified.')
However, in Megatron-LM-24*, the argument is by default disabled and recommended to be enabled for fp16 training.
group.add_argument('--apply-query-key-layer-scaling', action='store_true',
help='Scale Q * K^T by 1 / layer-number. '
'Useful for fp16 training.')
group.add_argument('--attention-softmax-in-fp32', action='store_true',
help='Run attention masking and softmax in fp32. '
'This flag is ignored unless '
'--no-query-key-layer-scaling is specified.')
More evidence shown on Megatron-LM upstream testcase:
Megatron-LM/tests/functional_tests/test_scripts/gpt3/pretrain_gpt3_distributed_test.sh
--${TRAINING_DTYPE}"
if [[ "${TRAINING_DTYPE}" == "fp16" ]]; then
torch_run_cmd+=" --apply-query-key-layer-scaling"
fi
Above arguments hence lead to different training result. This patch is used to fix this behaviours. So if fp16 training is being launched, scale factor would take effect to avoid training instable.
There is argument behaviour change in latest megatron-lm repo as below:
In Megatron-LM-23*, qk factor is by default enabled which helps make training stable especially in fp16 case.
However, in Megatron-LM-24*, the argument is by default disabled and recommended to be enabled for fp16 training.
More evidence shown on Megatron-LM upstream testcase:
Above arguments hence lead to different training result. This patch is used to fix this behaviours. So if fp16 training is being launched, scale factor would take effect to avoid training instable.