Model doesn't start training. It is in infinite loop and doesn't reach https://github.com/DeyvidKochanov-TomTom/kprnet/blob/master/train_kitti.py#L62

DeyvidKochanov-TomTom / kprnet

MIT License

66 stars 12 forks source link

Model doesn't start training. It is in infinite loop and doesn't reach https://github.com/DeyvidKochanov-TomTom/kprnet/blob/master/train_kitti.py#L62 #2

Closed chowkamlee81 closed 4 years ago

chowkamlee81 commented 4 years ago

I started training using commands given in Readme.

But when i start training it doesn't reach line https://github.com/DeyvidKochanov-TomTom/kprnet/blob/master/train_kitti.py#L62

Its trange that model is not getting trained. Kindly suggest

DeyvidKochanov-TomTom commented 4 years ago

Hi, are you using 8 GPUs? The world size is hardcoded to be 8 so unless it gets 8 GPUs the distributed initialization will not finish. You can change the world size (line 58) or use 8 GPUs. If the problem persists could you set the env variable NCCL_DEBUG=INFO and show the logs

chowkamlee81 commented 4 years ago

iam using 2 gpus

chowkamlee81 commented 4 years ago

Iam able to do training . But i had 11GB GPUS. I cant do training with batch_size=2. Will i get same performace in accuracy if i train with batch_size=1?

chowkamlee81 commented 4 years ago

Even i cant do training with batch_size=1 with 2GPUS of Geforce 10180Ti. Am i correct?

DeyvidKochanov-TomTom commented 4 years ago

Hi, to reproduce the results from the paper you will need 8 16GB GPUs. To get reasonable results maybe change the backbone to something like resnet50 so that you have total batch size of at least 12 on your GPUs although we haven't really experimented with that and cannot say how well it will work. You will have to also tune things like the learning rate and OHEM loss threshold.

As I understand the training now works and I will close the issue.

chowkamlee81 commented 4 years ago

closed