Open goutamyg opened 1 year ago
For a downstream task, I see better training convergence upon normalizing both x and _xprev during the computation of cross-attention here: https://github.com/apple/ml-cvnets/blob/main/cvnets/modules/transformer.py#L258
Currently, I am conducting model training with and without the proposed normalization of _xprev and will share the results for the two cases. In the meantime, if this change makes sense, kindly include it. Let me know if you need any related info.
For a downstream task, I see better training convergence upon normalizing both x and _xprev during the computation of cross-attention here: https://github.com/apple/ml-cvnets/blob/main/cvnets/modules/transformer.py#L258
Currently, I am conducting model training with and without the proposed normalization of _xprev and will share the results for the two cases. In the meantime, if this change makes sense, kindly include it. Let me know if you need any related info.