Question about why the add&norm structure of the tranformer network differ from the typical transformer one

TuSimple / centerformer

Implementation for CenterFormer: Center-based Transformer for 3D Object Detection (ECCV 2022)

MIT License

294 stars 28 forks source link

Question about why the add&norm structure of the tranformer network differ from the typical transformer one #24

Open Liaoqing-up opened 1 year ago

Liaoqing-up commented 1 year ago

https://github.com/TuSimple/centerformer/blob/96aa37503dc900d1aebeb7c1086c33bbd0c01d26/det3d/models/utils/transformer.py#L267-L279 In the code, the residual in transformer is only the input after add and does not pass through the norm layer. add and norm are not taken as a whole, which is different from the typical transformer structure (the result of add and norm in series as a new level of input). Is there any special consideration for the design here?

edwardzhou130 commented 1 year ago

I used prenorm inside each layer. https://github.com/TuSimple/centerformer/blob/96aa37503dc900d1aebeb7c1086c33bbd0c01d26/det3d/models/utils/transformer.py#L218-L238

Liaoqing-up commented 1 year ago

I used prenorm inside each layer.

https://github.com/TuSimple/centerformer/blob/96aa37503dc900d1aebeb7c1086c33bbd0c01d26/det3d/models/utils/transformer.py#L218-L238

I see, but I wonder if you have tried Add&Norm after each layer, which means the residual skip connect input are the features already passed through the Norm. Is it possible that the results of these two structures do not differ much?

edwardzhou130 commented 1 year ago

Sorry, I haven't tried Add&Norm after each layer. Do you have experience with this before and would the results be better if you used this implementation?