Open yueyangwen opened 1 month ago
Maybe you can try to add a normalization in QKV project just as PTv2?
I also notice gradient explosions giving me nans and infs, I suspect some additional normalization might help, but I'm not sure where and for which layer. For now I am using gradient clipping at a value of 1.0 (torch.nn.utils.clip_gradnorm(model.parameters(), grad_clip_val))
It seems to be related more to the 3D sparse convolutions for my use case rather than the attention stuff. I'm curious if you have tried using regular batch norm here instead of layer norm? In my mind the layer norm makes more sense, but I do see examples in spconv using 3d convolutions followed by batch norm. I'll give it a shot and report back at some point.
Maybe you can try to add a normalization in QKV project just as PTv2?
do you mean at the start or at the end of the attention layer ?
Hello, I trained on my two-category semantic segmentation task. I found that the first few pieces of data trained well, but the performance will be very poor later. I output the gradient information to view it. I found that as the network trains, More and more layers are experiencing the vanishing gradient problem, but I cannot add a bn layer to the network. Do you have any suggestions?