Fix issues that example mcore models dont scale query value.

There is argument behaviour change in latest megatron-lm repo as below:

In Megatron-LM-23*, qk factor is by default enabled which helps make training stable especially in fp16 case.

    group.add_argument('--no-query-key-layer-scaling', action='store_false',
                       help='Do not scale Q * K^T by 1 / layer-number.',
                       dest='apply_query_key_layer_scaling')
    group.add_argument('--attention-softmax-in-fp32', action='store_true',
                       help='Run attention masking and softmax in fp32. '
                       'This flag is ignored unless '
                       '--no-query-key-layer-scaling is specified.')

However, in Megatron-LM-24*, the argument is by default disabled and recommended to be enabled for fp16 training.

    group.add_argument('--apply-query-key-layer-scaling', action='store_true',
                       help='Scale Q * K^T by 1 / layer-number. '
                       'Useful for fp16 training.')
    group.add_argument('--attention-softmax-in-fp32', action='store_true',
                       help='Run attention masking and softmax in fp32. '
                       'This flag is ignored unless '
                       '--no-query-key-layer-scaling is specified.')

More evidence shown on Megatron-LM upstream testcase:

Megatron-LM/tests/functional_tests/test_scripts/gpt3/pretrain_gpt3_distributed_test.sh

       --${TRAINING_DTYPE}"

if [[ "${TRAINING_DTYPE}" == "fp16" ]]; then
    torch_run_cmd+=" --apply-query-key-layer-scaling"
fi

Above arguments hence lead to different training result. This patch is used to fix this behaviours. So if fp16 training is being launched, scale factor would take effect to avoid training instable.

alibaba / Pai-Megatron-Patch

Fix issues that example mcore models dont scale query value. #244