Refine the Universal Transformer model to match the new LSTM model

Now, that the LSTM has finally achieved the level of the baseline from Karpathy's blog (http://karpathy.github.io/2015/05/21/rnn-effectiveness), the UT model also needs to be improved to match it.

As an option, it would be interesting to try the standard Trainer from the Hugging Face and a specialized Latex Tokenizer like e.g. MathBerta.

dfsbora / latex-math-model