Closed roddar92 closed 1 year ago
Unfortunately we don't have experience with deepspeed. Are you experiencing the same error without it?
Unfortunately, I have OOM exception with other shared option (ddp_sharded
).
@roddar92 Could you provide a little more details? Assuming it's CUDA OOM, can it be solved by using smaller batch sizes? Do you see the same error without ddp_sharded (using the default ddp)?
Well, If I decrease the size of batch, I don't see this error. The same solution I use on 1 GPU. On the other hand, I get bad quality of my trained models. Especially, TOP-5 accuracy is not higher than 10%..
Dear colleagues,
During training process on several gpus, I have an exception like this:
How could I fix this error on validation step?
My current hyperparameters for multi-training are:
Thanks in advance.