Closed rabbiahassan closed 3 years ago
Hi,
Since our cuda_kernel runs in a multi-thread parallel strategy, only 2 gpus may not be able to serve for the need of paralleled threads.
To solve this issue, you may try to run it on more gpus; or try smaller batch_size (this probably causes performance drop due to not the best batch_size setting, but need less thread).
Thanks for the response. I reduced the batch size to even minimum but still it doesn't work.I think this issue is not related to the batch size or heavy computation.I am attaching the memory status of gpu alongwith. I think it gets stuck somewhere but doesn't show any error.
Ok, if this is the first time you run your classification code, please wait for some time (about 1-2minutes, depending on the hardware) for compiling the CUDA op.
Also, after you finish compiling, if it stucks again at loss.backward caused by the limited threads, please solve it by reducing the batch_size or using more gpus.
Thanks for your response again. I have reduced batch size to even 4 but still it doesnt work. I am attaching the gpu usage screenshot as well.I think it gets stuck even before,(because its not even using gpu to the full capacity).
Does it keep stuck? Have you waited for more than 2 minutes?
Yes it does.I have waited for five hours.It just doesn't proceed an inch.
Ok, I have just run the code and the program runs normally with normal speed, while I use 4 3090Ti gpus or 4 2080Ti gpus under original batch_size.
As shown in your picture, I can make sure that the code is OK and you have compiled the cuda lib.
So as I mentioned before, this is caused by the very limited thread provided by your GPU (not only depended on the number of gpus but also the type of gpus).
What you can do now is to run on more gpus or better gpus to support our cuda_kernel.
Excuse me, I also encountered this problem. Is it solved now?
Excuse me, I also encountered this problem. Is it solved now?
For me using the pointnet option worked!
Hello ! your work is very interesting.When I tried to put the classification model on training it doesn't show any error,but it does get stuck here and doesn't proceed forward.Please tell me what is this issue? Thanks for your time.