loss value and acc seems wrong when after setting gradient_accumulate_step>1 in Deepspeed training

Vindicator645 commented 1 month ago

System Info

Nvidia A100

Information

[X] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

When training a model with asr_librispeech script, i get a loss around 8 initially, with ddp i get around 8 with gradient accumulate as well; but when using deepspeed, with gradient accumulate=1 initial loss is 8, but with gradient accumulate=10 the loss value is 0.8; setting gradient accumulate in ds_config does nothing, setting gradient_accumulation_steps=10000 takes the same time as gradient_accumulation_steps=1

Error logs

loss=8 for gradient_accumulation_steps=1 and loss=0.8 gradient_accumulation_steps=10

Expected behavior

loss should on the same maginitute reguardless gradient_accumulation_steps

Vindicator645 commented 1 month ago

I suspect the loss = loss / gradient_accumulation_steps and acc = acc / gradient_accumulation_steps should be removed in deepspeed_utils

fclearner commented 2 weeks ago

try this: model_engine.backward(loss)

if (step + 1) % model_engine.gradient_accumulation_steps() == 0: model_engine.step() model_engine.zero_grad()

fclearner commented 2 weeks ago

try this: model_engine.backward(loss)

if (step + 1) % model_engine.gradient_accumulation_steps() == 0: model_engine.step() model_engine.zero_grad()

sorry, remove gradient_accumulation_steps is enough

X-LANCE / SLAM-LLM