Open EliasHasle opened 1 year ago
Thanks @EliasHasle for the comment. We also had what you described here in the code: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/universal_transformer_util.py#L1162
... and ran experiments with all different cases for UT: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/universal_transformer_util.py#L1045
OK. I did not know that. Thanks. 😄 The different ACT modes are not mentioned in the version of the paper I read from arXiv. Did you discuss them in another version, or in your thesis?
I had all the results of this ablation written in an internal doc and I'm not able to find it :/ But I remember that the version that made it to the paper/thesis was the best one across the board.
https://github.com/MostafaDehghani/Thesis/blob/7033ec5471584f1f422e33693dcdb51583a5eedb/04-part-03/chapter-06/figs_and_tables/alg_ut_with_act.tex#L91
Graves suggests approximating the final output (of a stochastically halting model) as a weighted average of states according to the stepwise halting probabilities.
Your code instead sequentially updates the output (and possibly even the internal state?) as a weighted average of the previous state and the newly generated one, according to a splitting by the newest stepwise halting probability.
The two are not at all equivalent. Was this code used for the experiments on the Universal Transformers?
Note that the update in this independent implementation of UT matches Graves: https://github.com/hncshp/universal-transformers-git/blob/master/universal_transformers.py#L753 While this (way more popular) one seems to match your paper: https://github.com/andreamad8/Universal-Transformer-Pytorch/blob/master/models/UTransformer.py#L286