About training time - Githubissues

WeijingShi / Point-GNN

Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud, CVPR 2020.

MIT License

525 stars 114 forks source link

About training time #8

Closed iris0329 closed 4 years ago

iris0329 commented 4 years ago

Thanks for your sharing.

When training the KITTI Car model, following the setting in configs/car_auto_T3_train_train_config, configs/car_auto_T2_train_config

I use 6 TITAN XP, so I change the batch size to 24. (the default setting is batch size 4 , 1 GPU)

but the time cost is 240 per epoch, so it will take over 1 year to train this model according the default "max_epoch": 1718 and "max_steps": 1400000,.

It has taken 8 hours to train 100 epoch, so when considering max_epoch is 1718, the whole training process will take 1700*8/100/24=5.7 days.

but when considering 135 steps per epoch, and each epoch takes 4 minutes, the whole process will take 4*1400000/135/60/24=27 days.

5.7 days or 27 days, I am confused.

Can you tell me if there is something wrong with it? and how could speed up the training process? and How many days you will take if according to default using 1 GPU and 4 batch size? Thank you.

WeijingShi commented 4 years ago

The default training config is 2 GPU and a batch size of 4. The training stops when it reaches either the max_epoch or max_step. In the default setting, training reach max_step first. (1400000*4/3260=1717.8 < 1718).

STEP here means the number of gradient updates, i.e. the number of batches. In your calculation, the max_steps cost 1400000/135/60/24 = 7.2 days.

Using the setting car_auto_T3_train_config, using 2 GTX 1080ti, the training takes about 0.43s per step in our machine. That's about 7 days.

Not sure why you didn't get more speedup though (note you are training a liter T2 model). Can you check your GPU usage and see if they stay idle a lot? if so, maybe your system is bottlenecking the GPU. Thanks.

WeijingShi commented 4 years ago

It's the total number of data samples within the train_car data split.

On Sun, Mar 22, 2020 at 10:29 AM Iris1234 notifications@github.com wrote:

Thank you, but what 3260 in 1400000*4/3260=1717.8 means?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WeijingShi/Point-GNN/issues/8#issuecomment-602210780, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ6I7TGZ5YXK6AYR4TCCUDRIYODTANCNFSM4LQ4SWEQ .

phoebedddd commented 1 year ago

感谢您的分享。

在训练KITTI Car模型时，按照中的设置configs/car_auto_T3_train_train_config，configs/car_auto_T2_train_config

我使用 6 TITAN XP，所以我将批量大小更改为 24。（默认设置为批量大小 4，1 个 GPU）

但是每个epoch的时间成本是240，所以按照默认的“max_epoch”：1718和“max_steps”：1400000，训练这个模型需要1年多的时间。

训练100个epoch用了8个小时，所以考虑max_epoch为1718时，整个训练过程需要1700*8/100/24=5.7天。

但是考虑到每个 epoch 135 步，每个 epoch 需要 4 分钟，整个过程将需要 4*1400000/135/60/24=27 天。

5.7天还是27天，我很迷茫。

你能告诉我它是否有问题吗？以及如何加快培训过程？如果默认使用 1 个 GPU 和 4 个批处理大小，您需要多少天？谢谢。 Hello, I have met the same problem, could you tell me how you solved the problem