Open NAse12 opened 2 years ago
When it got stucked, did your GPU memory value(provided by nvidia-smi) differs on different devices?
Sorry, I accidentally closed the issue.
All have the same nvidia-smi information in different devices that cannot use multi gpu.
It seems that you model is waiting for data. What about train on a single device?
If only one gpu is used on the same device, training proceeds normally.
Test datasets is All the same.
What's the last log info? init prefetcher
or just Rank 1 initialization finished
?
If you are referring to the last log in multi gpu, Rank 1 initialization finished is correct.
@NAse12 Aby update on this issue? I am also facing the same with A5000.
hello.
Please understand that I am using a translator as I am not good at English.
I have a problem in learning using multi gpus. Learning using multi gpu is not possible on some GPUs.
The models with the current problem are quadro RTX6000 and RTX A6000. However, in quadro rtx8000 gpu, learning using multi gpu is possible. The only difference between the two computers is whether the cpu is intel(quadro RTX8000) or AMD(quadro RTX6000, RTX A6000)
When learning using multi gpu, it stops with the following console message and my computer starts to freeze.
yolox.core.launch:_distributed_worker:119 - Rank 0 initialization finished. yolox.core.launch:_distributed_worker:119 - Rank 1 initialization finished.
I tested it on version and they all showed the same symptoms. pytorch 1.7.1, 1.8.0, 1.10.0 cuda 10.2, 11.0, 11.1
Please answer these questions. thank you.