EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.8k stars 985 forks source link

loss stuck in overflow for RPE position embedding together with sparse attention #292

Open sweinbach opened 3 years ago

sweinbach commented 3 years ago

Describe the bug Loss for RPE position embedding not going down

[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1

To Reproduce Steps to reproduce the behavior:

  1. Use small.yml and sparse.yml
  2. change position embedding to "rpe"
  3. train using deepy.py
StellaAthena commented 3 years ago

Note that this doesn't occur when running with dense attention as shown here