Closed vuhongai closed 1 year ago
Dear Ai,
First, my apologies for the slow response (I missed the notification)!
Unfortunately, the MultiHeadAttention layer implemented in our TPU model isn't the same as the one in Tensorflow 2 (which can be found here) because I don't believe the 'official' TF2 was available at the time our model was originally trained.
The MultiHeadAttention implementation we used can be found here.
Since there is now an 'official' TF2 implementation available, if you are training a new model from scratch, I would definitely recommend using it directly.
Good luck!
Best, Eeshit
Dear authors,
When I run your attention-based model on Colab, there is this "bug" that I can't explain why, I hope you can help. If the input_shape of MultiHeadAttention layer is (None,110,64) (example from Fig S12), the output shape is then (None, None, 64). Other than that, the model works fine because it can broadcast to its input_shape.
TensorShape([Dimension(None), Dimension(None), Dimension(64)])
You can reproduce the result here in this Colab notebook.
And I wonder if the MultiHeadAttention layer implemented in your TPU model is the same as pre-built MultiHeadAttention in Tensorflow 2 (both are from "attention is all you need" paper). The MHA from TF2 give expected output shape by the way (None, 110, 64).
Thank you for your help. Ai