Open tayton42 opened 1 year ago
try to use torchrun.
try to use torchrun.
Thank you for your answer!But I am not familiar with torchrun.Can you tell me how I should modify the RVM code?thanks anyway!!
Hi. I got the same problem. Have you find the solution to this problem yet? It would help me a great deal if you could share your experience here. Thank you!
Thank you for your research.I have a question about single multi-card training, when my code starts to![image](https://user-images.githubusercontent.com/52126085/231719604-a1608cc7-d7d8-4e98-bc41-c9d46a3e33d4.png)
self.model_ddp = DDP(self.model, device_ids=[self.rank], broadcast_buffers=False, find_unused_parameters=True)
Processes on other GPUs appear on GPU0, they have the same PID, this causes GPU0 memory overflow, I can't find the cause and solution, please help me.Thanks!