Open MengqingCao opened 4 months ago
The issue could be due to some mistake/old code. There has been deepspeed fixes recently. If you're still interested, would you be able to try them out?
The issue could be due to some mistake/old code. There has been deepspeed fixes recently. If you're still interested, would you be able to try them out?
OK, I'll try to fix it
Please check that this issue hasn't been reported before.
Expected Behavior
Only the main process running on card 0
Current behaviour
When performing distributed training on a single machine with multiple cards, e.g., 2 cards, there are 2 processes spwaned on card 0. And this makes OOM error occur frequently.
Steps to reproduce
Config yaml