My interpretation of get_custom_L2 is that L2 decay is applied not to the individual weights being trained, but instead to the deploy equivalent weights.
If this is the motivation, wouldn't the eq_kernel also incorporate the identity from the skip connection when self.rbr_identity is not None? Currently the contribution of rbr_identity in the eq_kernel in get_custom_L2 is missing. Was this intentional? Is there a reference or ablation for why you would exclude it?
My interpretation of get_custom_L2 is that L2 decay is applied not to the individual weights being trained, but instead to the deploy equivalent weights.
If this is the motivation, wouldn't the
eq_kernel
also incorporate the identity from the skip connection whenself.rbr_identity is not None
? Currently the contribution of rbr_identity in the eq_kernel in get_custom_L2 is missing. Was this intentional? Is there a reference or ablation for why you would exclude it?