Low training efficiency and warning "batch_size shortened from 8 to 1, points from 640000 to 80000."

maosuli commented 2 years ago

Hello!

Thanks for your great work. I tried to implement your code in my workstation.

I found that the max_batch_points and voxel_max were set to 140000 and 80000 in s3dis_stratified_transformer.yaml, respectively.

And batch_size was set to 8. In this case, the number of points in one batch can easily exceed 14w, say, the maximum is 8w*8=64w.

Maybe that's the reason why I often got this kind of warning,

"WARNING [main-logger]: batch_size shortened from 8 to 1, points from 640000 to 80000."

If I set the batch size to 1, the warning disappears, but less batch size brings more steps. Since I observed a low training speed per step, the overall training efficiency per epoch was poor with the whole S3DIS dataset. Do you have any suggestions to improve it?

I trained some other models like KPConv in my workstation. The training speed was higher and acceptable.

May I know whether the transformer model is relatively low in training efficiency? I am a layman in ML.

Cheers,

Eric.

X-Lai commented 2 years ago

Thanks for your interest in our work.

First, max_batch_points can be adjusted according to your hardware condition. This argument is set to avoid memory explosion. If you are using GPUs whose memory are more than 11G, you can set max_batch_points to a suitable value in your case as long as you have enough GPU memory.

Second, the training speed of the self-attention operator is slower than the convolutional counterpart. But in our model, you can change the training epochs to 50 for quick training (the performance gap with 100 epochs would be within 1.0% mIoU).

Hope these can be helpful.

maosuli commented 2 years ago

Thanks for your useful suggestions. Will have a try.

yuchenlichuck commented 2 years ago

Here is 1 A100 (80GB Memory) GPU's training warning:

WARNING [04/20 14:41:37 main-logger]: batch_size shortened from 16 to 1, points from 966974 to 80000

WARNING [04/20 14:41:38 main-logger]: batch_size shortened from 16 to 2, points from 1033624 to 127033

X-Lai commented 2 years ago

@yuchenlichuck So you can change the argument max_batch_points in the .yaml file to a much larger value.

dvlab-research / Stratified-Transformer

Low training efficiency and warning "batch_size shortened from 8 to 1, points from 640000 to 80000." #11