This repository formed part of the work: "Evaluating Transformers as Memory Systems in Reinforcement Learning", which investigated several Transformers and compared them to an LSTM baseline. The scaling properties of these models were examined in terms of two specific aspects of memory: length and size. Memory length refers to retaining information over increasing lengths of time, while size is the quantity of information that must be retained. The full report is available in here.
Memory is an important component of effective learning systems and is crucial in non-Markovian as well as partially observable environments. In recent years, Long Short-Term Memory (LSTM) networks have been the dominant mechanism for providing memory in reinforcement learning, however, the success of transformers in natural language processing tasks has highlighted a promising and viable alternative. Memory in reinforcement learning is particularly difficult as rewards are often sparse and distributed over many time steps. Early research into transformers as memory mechanisms for reinforcement learning indicated that the canonical model is not suitable, and that additional gated recurrent units and architectural modifications are necessary to stabilize these models. Several additional improvements to the canonical model have further extended its capabilities, such as increasing the attention span, dynamically selecting the number of per-symbol processing steps and accelerating convergence. It remains unclear, however, whether combining these improvements could provide meaningful performance gains overall. This dissertation examines several extensions to the canonical Transformer as memory mechanisms in reinforcement learning and empirically studies their combination, which we term the Integrated Transformer. Our findings support prior work that suggests gating variants of the Transformer architecture may outperform LSTMs as memory networks in reinforcement learning. However, our results indicate that while gated variants of the Transformer architecture may be able to model dependencies over a longer temporal horizon, these models do not necessarily outperform LSTMs when tasked with retaining increasing quantities of information.
Build the Docker file:
foo@bar:~$ make build
Run the Docker image:
Without GPU:
foo@bar:~$ make up
With GPU:
foo@bar:~$ make up USE_GPU=True
Specify the experiment you would like to run by editing the run_experiments.sh
file. Then run the experiment using the following make command:
foo@bar:~$ make run
The configuration for each experiment can be specified by writing/editing the configuration file (e.g. configs/experiment.yaml) or with flags in the command line interface. The configuration files are managed with Hydra, and the following options are available:
The hyperparameters for the model for each experiment can also be specified by adding them to the command line. For example:
python experiment.py env.name="memory_size" memory="gtrxl" memory.num_layers=10 experiment_info.device="cuda"
The Gated Transformer-XL, LSTM and the Integrated Transformer were the top three performing models on these experiments, followed by the Universal Transformer, ReZero and finally the Transformer-XL.
The Gated Transformer-XL, LSTM and the Integrated Transformer were the top three performing models on the memory size experiments. The Universal Transformer, ReZero and the Transformer-XL performed very poorly.