Closed mainpyp closed 1 year ago
Hi, Yes, we've tested the $h=128$ and $h=768$ and we found that the larger $h$ can not guarantee better performance. There are many factors anyway, but so far this hyper-parameter does not take much effect in these settings. As for more training samples or more complex tasks or more Transformer layers, it's another story.
Thank you for you reply! :)
Hey, I was wondering if you have tested the effect of the hidden dimension on the training, and if yes, what were your findings?