Cc-Hy / CMKD

Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection (ECCV 2022 Oral)
Apache License 2.0
108 stars 9 forks source link

About the speed of multi-gpu training #9

Closed LiewFeng closed 1 year ago

LiewFeng commented 1 year ago

Hi, @Cc-Hy .When I train the model on kitti train, 2 GPUs takes more time than 1 GPU, which is really strange. Do you encounter this pthenomenon?

sunnyHelen commented 1 year ago

Hi, can I ask how much GPU memory can afford the training of this model? I need to evaluate if my GPU memory is enough to try it.

sunnyHelen commented 1 year ago

@LiewFeng

LiewFeng commented 1 year ago

@sunnyHelen ~18G.

sunnyHelen commented 1 year ago

Ok. Thanks a lot~

Cc-Hy commented 1 year ago

@LiewFeng That's very strange. Can you provide more details of your training? For example, what commond do you run, the batch size, and how much time is spent in both cases .

LiewFeng commented 1 year ago

Hi, @Cc-Hy .Sorry for the late reply. The command is the same as that provided by the GETTING_STARTED.md. I didn't modify the batch size. For 1 GPU setting, it takes about 10 mins for the first epoch. It should take 10 hours for 60-epoch training. However, it only takes 5 hours, which is really strange. For 2 GPU setting, it takes about 6 mins for the first epoch. It should take 6 hours for 60-epoch training. It only takes 6hours, which is normal. Another phenomenon is that the cpu utilization of 1 GPU setting is high, while that of 2 GPU setting is really low.

LiewFeng commented 1 year ago

Experiments are conducted on kitti train.

Cc-Hy commented 1 year ago

@LiewFeng Hi, it seems your 2 GPU training time is close to mine. It takes ~ 6 minutes for each epoch and I use 2 NVIDIA GeForce RTX 3090. And it takes ~ 12 minutes for each epoch when I use one GPU.

So I think your 2 GPU training time is normal. But if your GPUs are really working at very low utilization, you may check your CPU status. I once met this situation where my CPU was suffering from a bottleneck and the GPU could not work fully.

LiewFeng commented 1 year ago

Hi, @Cc-Hy . I figure it out. The reason is the version of pytorch. When I run the experiment with 1 GPU, the pytorch version is 1.10. When I try to run with 2 GPUs, it gets stuck. Then I turn to pytorch 1.8 and it can work, but 2x slower. I am using A 100. It's about 2x faster than 3090. I still get stuck with 2GPU. It seems it's solved in OpenPCDet. Sadly, it doesn't work for me.

LiewFeng commented 1 year ago

Problem of getting stuck fixed here and it works for me.