Closed rtaylor-rx-m closed 1 week ago
In the case where the MLP uses the 1-2-2-1 structure, the additional depth would cause the double-length residual to potentially lead to unstable gradients. Best just to use the traditional residual structure.
Fixed.
In the case where the MLP uses the 1-2-2-1 structure, the additional depth would cause the double-length residual to potentially lead to unstable gradients. Best just to use the traditional residual structure.