Fixed the missing token normalization for cross-attention computation

For a downstream task, I see better training convergence upon normalizing both x and _xprev during the computation of cross-attention here: https://github.com/apple/ml-cvnets/blob/main/cvnets/modules/transformer.py#L258

Currently, I am conducting model training with and without the proposed normalization of _xprev and will share the results for the two cases. In the meantime, if this change makes sense, kindly include it. Let me know if you need any related info.

apple / ml-cvnets

Fixed the missing token normalization for cross-attention computation #82