Pointcept / PointTransformerV3

[CVPR'24 Oral] Official repository of Point Transformer V3 (PTv3)
MIT License
583 stars 30 forks source link

I output gradient values during training and observed that gradient explosion has always existed, but as long as nan values do not appear, it does not seem to affect the training effect. However, more and more gradient disappearance problems have made it impossible for my network to learn useful information. #54

Open yueyangwen opened 1 month ago

yueyangwen commented 1 month ago

Hello, I trained on my two-category semantic segmentation task. I found that the first few pieces of data trained well, but the performance will be very poor later. I output the gradient information to view it. I found that as the network trains, More and more layers are experiencing the vanishing gradient problem, but I cannot add a bn layer to the network. Do you have any suggestions?

Gofinge commented 1 month ago

Maybe you can try to add a normalization in QKV project just as PTv2?

JamesMcCullochDickens commented 6 days ago

I also notice gradient explosions giving me nans and infs, I suspect some additional normalization might help, but I'm not sure where and for which layer. For now I am using gradient clipping at a value of 1.0 (torch.nn.utils.clip_gradnorm(model.parameters(), grad_clip_val))

It seems to be related more to the 3D sparse convolutions for my use case rather than the attention stuff. I'm curious if you have tried using regular batch norm here instead of layer norm? In my mind the layer norm makes more sense, but I do see examples in spconv using 3d convolutions followed by batch norm. I'll give it a shot and report back at some point.

SpeedyGonzales949 commented 18 hours ago

Maybe you can try to add a normalization in QKV project just as PTv2?

do you mean at the start or at the end of the attention layer ?