Batch time varies severely across training time

Pointcept / PointTransformerV3

[CVPR'24 Oral] Official repository of Point Transformer V3 (PTv3)

MIT License

583 stars 30 forks source link

Batch time varies severely across training time #33

Open warriorsniu opened 2 months ago

warriorsniu commented 2 months ago

I run the semseg-pt-v3m1-0-rpe experiment on the S3DIS dataset with batchsize 4 and 4 3090 24GB gpus. And the num_worker is set to 12. The training batch time is relatively stable and fast at the beginning. However, it could be stuck for a few seconds, sometimes 40 seconds and even more. I also checked the time cost in different steps in run_step() and I found that the backward step could be the bottleneck. Is it common? Would you please give me some suggestions about the problem? Thanks a lot!

Gofinge commented 2 months ago

Hi, we noticed that S3DIS exists in two huge scenes in Area_2, which caused a large amount of data time. Maybe splitting the two scenes into multiple splits is a good solution.

warriorsniu commented 2 weeks ago

Thanks for your reply. However, as shown in the figure, the most time-consuming step is the backward step (stuck for nearly one minute). What could be the reason for that？