Validation hangs after first validation sample when using deepspeed stage 3 with multi-gpu

Hi, thanks again for this excellent repo. The model trains fine for me on a multi-gpu system using deepspeed stage 2 with or without offload, and I see the expected training/validation time reduction. However, when I train with deepspeed stage 3, the training_step progresses fine, but the model hangs indefinitely immediately after the validation step starts and one validation sample is completed. By adding logging, I was able to determine that the model generates tokens one-by-one for the multiple GPUs simulataneously as expected, but as soon as one GPU is done generating the sequence for its validation sample, all the other GPUs hang immediately afterwards. This makes me suspect the issue has something to do with synchronization across the GPUs, where it will not allow the other GPUs to continue to go through calls to the forward function which generates tokens one-by-one during validation here, since one GPU has already finished that part. However I don't understand why this issue would manifest only during stage 3 and not also during stage 2.

Has anyone else successfully trained this model, including the validation step, using deepspeed stage 3? I've tried adjusting some of the deepspeed_config parameters, but no success yet. Thanks!

aehrc / cxrmate

Validation hangs after first validation sample when using deepspeed stage 3 with multi-gpu #15