Describe the bug
Loss for RPE position embedding not going down
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
[2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51
[2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51
[2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
[2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
[2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52
[2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
[2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52
[2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1
[2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
Describe the bug Loss for RPE position embedding not going down
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
To Reproduce Steps to reproduce the behavior: