Could you offer an example code for me?

hydro-man commented 1 year ago

When I use PPO to training my agent in Numpad Env.(2*2 size), I found the reward always keep in low level. So I want to know how to set the hyper-params(such as learning rate) to train my model?

Emil2468 commented 1 year ago

In this repo you'll find the code that we used to train our models https://github.com/Syrlander/transformers-in-RL.

Specifically, you can find a PPO-config that we used here https://github.com/Syrlander/transformers-in-RL/blob/main/rl_thesis/models/PPO/config.py and here https://github.com/Syrlander/transformers-in-RL/blob/main/configs/numpad_discrete/ppo_config.json (The properties in the json file overwrite the ones in the python file). We tried out quite a few different policy and model configurations, and I am unsure which gave the the best results, but I believe that the ones you'll find in the two files refenced above combined with an MLP-policy provided surprisingly good results, considering that to really shine in this task the model needs some form of long term memory.

We mostly trained in the 3x3 environment, but I suppose that the 2x2 would be simpler to solve. What policy are you using?

hydro-man commented 1 year ago

Thank you for your reply!! I got the code from here: https://github.com/Syrlander/transformers-in-RL/tree/main/rl_thesis/gated_transformer and I use this code to learn Numpad env. by using GTrXL and VMPO/PPO. I just replace the deepmind_lab Env. with your Numpad Env. and put the observation 0/1 matrix directly into the GTrXL. I want to know if the hyper params in here https://github.com/Syrlander/transformers-in-RL/blob/main/rl_thesis/models/GTrXL/config.py could work in my task.

Emil2468 commented 1 year ago

We did not have any luck with the GTrXL. We think the issue may have been "short" training time. We trained our model for a week, but calculated that to get even close to the amount of training steps that they do in the paper we compared to (https://arxiv.org/abs/1910.06764) we would have to train for 2-4 times as long, with our setup (single GPU and 4-8 CPUs)

So unfortunately we don't know what hyperparameters are good for this model.

hydro-man commented 1 year ago

Thank you for your helping. I also notice the problem about amount of training steps. In this paper, the 22 Numpad uses 1e8 steps. I will try to use PPO to train this env. Could you tell me Which Graph Card you use and how much time you spend on training Numpad 33 Env. Thank you for your answering again.

Emil2468 commented 1 year ago

We used a number of different GPUs depending on their availability on the cluster we had access to, but it was either Quadro RTX 6000 or Titan X/Xp/V/RTX. And with that we trained the GTrXL with VMPO for about a week, but without the model getting better than random results.

We also tried training a PPO model with a simpler MLP policy, which actually got better than random results, and it only took about 10 minutes.

Syrlander commented 1 year ago

As a side note, @hydro-man, if you intend to implement the larger GTrXL models, I would strongly recommend taking a look at the training setup (Appendix B.1) of https://arxiv.org/pdf/1910.06764.pdf. They utilize a distributed setup with multiple TPUs, based upon IMPALA (https://arxiv.org/abs/1802.01561), which is the reason why they can achieve billions of training steps within a reasonable amount of wall-time. If you are using PyTorch there is an implementation by Meta at: https://github.com/facebookresearch/torchbeast.

Syrlander / numpad-gym

Could you offer an example code for me? #3