Closed chowkamlee81 closed 4 years ago
Hi, are you using 8 GPUs? The world size is hardcoded to be 8 so unless it gets 8 GPUs the distributed initialization will not finish. You can change the world size (line 58) or use 8 GPUs. If the problem persists could you set the env variable NCCL_DEBUG=INFO and show the logs
iam using 2 gpus
Iam able to do training . But i had 11GB GPUS. I cant do training with batch_size=2. Will i get same performace in accuracy if i train with batch_size=1?
Even i cant do training with batch_size=1 with 2GPUS of Geforce 10180Ti. Am i correct?
Hi, to reproduce the results from the paper you will need 8 16GB GPUs. To get reasonable results maybe change the backbone to something like resnet50 so that you have total batch size of at least 12 on your GPUs although we haven't really experimented with that and cannot say how well it will work. You will have to also tune things like the learning rate and OHEM loss threshold.
As I understand the training now works and I will close the issue.
closed
I started training using commands given in Readme.
But when i start training it doesn't reach line https://github.com/DeyvidKochanov-TomTom/kprnet/blob/master/train_kitti.py#L62
Its trange that model is not getting trained. Kindly suggest