Multi Gpu train - Githubissues

NAse12 commented 2 years ago

hello.

Please understand that I am using a translator as I am not good at English.

I have a problem in learning using multi gpus. Learning using multi gpu is not possible on some GPUs.

The models with the current problem are quadro RTX6000 and RTX A6000. However, in quadro rtx8000 gpu, learning using multi gpu is possible. The only difference between the two computers is whether the cpu is intel(quadro RTX8000) or AMD(quadro RTX6000, RTX A6000)

When learning using multi gpu, it stops with the following console message and my computer starts to freeze.

yolox.core.launch:_distributed_worker:119 - Rank 0 initialization finished. yolox.core.launch:_distributed_worker:119 - Rank 1 initialization finished.

I tested it on version and they all showed the same symptoms. pytorch 1.7.1, 1.8.0, 1.10.0 cuda 10.2, 11.0, 11.1

Please answer these questions. thank you.

FateScript commented 2 years ago

When it got stucked, did your GPU memory value(provided by nvidia-smi) differs on different devices?

NAse12 commented 2 years ago

Sorry, I accidentally closed the issue.

스크린샷, 2021-11-30 16-11-42 All have the same nvidia-smi information in different devices that cannot use multi gpu.

FateScript commented 2 years ago

It seems that you model is waiting for data. What about train on a single device?

NAse12 commented 2 years ago

If only one gpu is used on the same device, training proceeds normally.

Test datasets is All the same.

FateScript commented 2 years ago

What's the last log info? init prefetcher or just Rank 1 initialization finished?

NAse12 commented 2 years ago

If you are referring to the last log in multi gpu, Rank 1 initialization finished is correct.

debapriyamaji commented 2 years ago

@NAse12 Aby update on this issue? I am also facing the same with A5000.

Megvii-BaseDetection / YOLOX

Multi Gpu train #955