Training time and Multi-GPUs training

NVIDIA / ContrastiveLosses4VRD

Implementation for the CVPR2019 paper "Graphical Contrastive Losses for Scene Graph Generation"

Other

199 stars 41 forks source link

Training time and Multi-GPUs training #11

Closed cao-nv closed 4 years ago

cao-nv commented 4 years ago

Hi, Firstly, thank you for sharing this great repository. I have two questions regarding the training of this code.

If I set the batch size to 1 and training by 1 Geforce Titan GPU, how long would it take to train a model with VGG16 backbone on Visual Genome? I tried and the output said that the time is about 2 days.
When I tried to train the above model on 3 GPUs, the code seems to be stuck. Thank you.

jz462 commented 4 years ago

Hi @cao-nv,

The default batch size is 1 (one image per iteration), and the default #GPU is 8. If you set #GPU to be 1, then you need to manually set the number of iterations 8 times as the current one since it won't change automatically. That is probably why you still see the training time as "about 2 days".
I can't figure out anything if there is no more information about this issue.

cao-nv commented 4 years ago

Thanks for your response. You mean the number of iterations should be 62723*8, isn't it? After setting the number of GPUs to 1 and the number of iterations to the above number, the estimated training time is now 14 days. How long did it take you to train in your configurations?

jz462 commented 4 years ago

@cao-nv It took me about 2 and half days for training.

cao-nv commented 4 years ago

It's very kind of you to reply to me rapidly. The training time is quite long. Thank you again (y)

sandeep-ipk commented 4 years ago

Hey @cao-nv @jz462

Do you mean that by increasing the number of GPUs to N, we will train the model 62723*N times? That is, we will be doing N epochs simultaneously? Or will we be able to complete the training of 62723 images in 62723/N iterations? Please help me, I'm a bit confused here. Is his pre-trained model trained for 62723 iterations or 62723*8 iterations?

Thank you.

cao-nv commented 4 years ago

Hey @cao-nv @jz462

Do you mean that by increasing the number of GPUs to N, we will train the model 62723N times? That is, we will be doing N epochs simultaneously? Or will we be able to complete the training of 62723 images in 62723/N iterations? Please help me, I'm a bit confused here. Is his pre-trained model trained for 62723 iterations or 627238 iterations?

Thank you.

As I understand, the number of iterations in the config files is for the mini-batch size of 8, i.e training on 8 GPUs. Hence, if you use 1 GPUs, you should change the number of iterations to 62723*8. In other words, you should train the model for 8 epochs. Hope this help!

sandeep-ipk commented 4 years ago

Thank you @cao-nv @jz462 . I get that if I have 1 GPU then I will set the iterations should be set to 62723*8, but if you use 8 GPUs then the minibatch size is 8. Now setting the num iterations 62723*8 and using 8 GPUs for minibatch size 8 is different right?

In case 1 (62723*8 iterations): We use a batch size 1, so the gradient approximation is based on 1 example but we complete 8 times more iterations.

In case 2 (62723 iterations, with 8 GPUs) We use a batch size 8, so gradient approximation is better with 8 examples but we only complete 62723 iters.

Therefore case 1 and case 2 are different right? then how can we claim both are the same. Can you please explain, I'm very confused about the DataParallel in PyTorch. And what about the SOLVER.STEPS: [0, 41815, 55754], which he used for 8 GPUS. How do you change it when you have 62723*8 iterations?

Thank you!

cao-nv commented 4 years ago

Thank you @cao-nv @jz462 . I get that if I have 1 GPU then I will set the iterations should be set to 627238, but if you use 8 GPUs then the minibatch size is 8. Now setting the num iterations 627238 and using 8 GPUs for minibatch size 8 is different right?

In case 1 (62723*8 iterations): We use a batch size 1, so the gradient approximation is based on 1 example but we complete 8 times more iterations.

In case 2 (62723 iterations, with 8 GPUs) We use a batch size 8, so gradient approximation is better with 8 examples but we only complete 62723 iters.

Therefore case 1 and case 2 are different right? then how can we claim both are the same. Can you please explain, I'm very confused about the DataParallel in PyTorch. And what about the SOLVER.STEPS: [0, 41815, 55754], which he used for 8 GPUS. How do you change it when you have 62723*8 iterations?

Thank you!

Yes, the two cases are different. Certainly, you had better to follow the provided setting to get comparable results.