Mario_PPO_RND
Playing Super Mario Bros with Proximal Policy Optimization (PPO) and Random Network Distillation (RND)
Introduction
My PyTorch Proximal Policy Optimization (PPO) + Random Network Distillation (RND) implement to playing Super Mario Bros. There are PPO paper and RND paper.
Results
Motivation
I just tried both A2C and PPO, but my algorithms can't complete the hardest stage (stage 8-4). PPO only helped Mario complete 31/32 stages. When I try to play stage 8-4 with PPO, I encounter three problems:
- The reward system is very bad, causing Mario to be unable to complete this stage. Mario still earns rewards when he moves in a loop. I solved this similarly to how I handled stages 4-4 and 7-4.
- The coordinate system is poorly implemented. I found that some x positions are duplicated:
- The first exact pipe has duplicate coordinates and is smaller than the road ahead.
- The sea map has its x coordinates reset.
- Because of this, I had to hard-code the reward system: Determining each x coordinate segment as a repeated line segment to set done = True and give a negative reward.
- Mario needs to find a hidden brick to complete this stage:
- There is a water pipe that Mario must jump onto a hidden brick before entering.
- If Mario goes right through the water pipe, he will enter a looped path.
- The map is long before Mario is forced to discover this secret.
- If we just prevent Mario from going right (avoiding the repeating path) as usual, he will learn that staying still is the best strategy instead of trying to find the hidden brick.
- I tried three strategies with PPO, but they weren't effective enough, so I looked for other methods to let the agent explore better and combined RND with PPO to solve stage 8-4:
- Give 50 rewards when Mario finds hidden bricks.
- Give 50 rewards when Mario goes down the correct pipe.
- Add more penalty reward when Mario die near hidden brick (-100 instead of -50 as other part of this map).
- On my code, I always chose the last strategy because it seemed the fairest. Other strategies don't actually encourage Mario to explore the environment for the brick but rather just force him to follow the correct path. If you want add more reward when Mario goes down the correct pipe (option 2), you can set config.additional_bonus_state_8_4_option = "right_pipe".
- Note: Actually, Mario can learn how to double jump to jump on the pipe without finding the brick, but this is very difficult and requires a lot of luck, difficult to recreate if you train again.
How to use it
You can use my notebook for training and testing agent very easy:
- Train your model by running all cell before session test
- Test your trained model by running all cell except agent.train(), just pass your model path to agent.load_model(model_path)
Or you can use train.py and test.py if you don't want to use notebook:
- Train your model by running train.py: For example training for stage 1-4: python train.py --world 1 --stage 4 --num_envs 8
- Test your trained model by running test.py: For example testing for stage 1-4: python test.py --world 1 --stage 4 --pretrained_model best_model.pth --num_envs 2
Trained models
You can find trained model in folder trained_model
Hyperparameters
How I Find Hyperparameters for Each Stage:
- First, I find optimal hyperparameters for stages 8-4 (I am doing this project just to complete this stage because normal PPO wins 31/32 other stages):
- I find that a larger number of environments works better because it explores more things when run in parallel. I set the number of environments to 32 (when testing normal PPO, a number of environments larger than 16 has no effect).
- As I saw with PPO, the learning steps need to be greater than the episode steps for stable training (except for easy stages) because the model will see a correct return. So, I set the learning steps to 756.
- Because I can’t set the batch size to 64 (it doesn’t work and requires a smaller update proportion), I set the batch size to 256.
- I tried tuning gamma and gamma_int in the range (0.9, 0.95, 0.99, 0.999), and I found that gamma = 0.99 and gamma_int = 0.99 work better.
- I set update_proportion to 0.25 as jcwleo RND. When training stages 8-4, I didn’t change this parameter and didn’t realize the correlation between update_proportion and batch size (a larger batch size requires a smaller update_proportion; I will discuss this later).
- I tuned int_adv_coef and ext_adv_coef and found that int_adv_coef = 1 and ext_adv_coef = 2 work better.
- I tuned entropy_coef between 0.01 and 0.05 and found that 0.05 works better (because we need more exploration).
- I didn’t change epoch = 10, lambda = 0.95, learning_rate = 7e-5, target_kl = 0.05, clip_param = 0.05, max_grad_norm = 0.5, norm_adv = False, V_coef = 0.5, and loss_type = ‘mse’ (because I think they are best when tuning normal PPO, just my biased experiment).
- After finding optimal hyperparameters for stages 8-4, I used them as the default and won almost all stages (only changing the number of environments to 8 or 16).
- With other stages, I couldn’t complete them with the default hyperparameters, and I noticed that RND just makes training time slower (compared with normal PPO), so I set int_adv_coef = 0.1 and ext_adv_coef = 1. These hyperparameters helped me complete more difficult stages (you can revert to normal PPO, but if I want to test RND, I don’t want to disable it).
- Now, I only have a few difficult stages left. I think that when I set the batch size to 64, RND updates more frequently, making intrinsic rewards ineffective. I tried setting update_proportion = 0.05 and batch size to 64, and then the algorithm worked, and I completed more stages.
- Finally, I set loss_type to ‘hyber’ and completed all stages.
- I randomly tuned entropy_coef between 0.01 and 0.05 (I don’t have enough evidence about the effect of this hyperparameter).
- Note:
- RL is very sensitive to hyperparameters, and some hyperparameters work for certain stages but not for others. Therefore, we need custom hyperparameters for some difficult stages like 5-3, 7-2, 8-1, and 8-4. I don’t have enough time and resources to find optimal hyperparameters that can complete all stages.
- I use min-max scaling for instrinsic reward because running mean std not working with me.
World |
Stage |
num_envs |
learn_step |
batchsize |
epoch |
lambda |
gamma |
gamma_int |
learning_rate |
target_kl |
clip_param |
max_grad_norm |
update_proportion |
norm_adv |
int_adv_coef |
ext_adv_coef |
V_coef |
entropy_coef |
loss_type |
training_step |
training_time |
default |
|
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
435990 |
5:30:41 |
1 |
1 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
435990 |
5:30:41 |
1 |
2 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
467440 |
7:43:46 |
1 |
3 |
16 |
512 |
64 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.05 |
FALSE |
0.1 |
1 |
0.5 |
0.05 |
mse |
444917 |
15:32:29 |
1 |
4 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
64982 |
0:42:23 |
2 |
1 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
1202627 |
14:45:26 |
2 |
2 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
0.1 |
1 |
0.5 |
0.01 |
mse |
1876990 |
1 day, 18:37:18 |
2 |
3 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
392697 |
6:06:10 |
2 |
4 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
145339 |
2:16:03 |
3 |
1 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
193534 |
4:51:21 |
3 |
2 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
195001 |
3:15:37 |
3 |
3 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
843736 |
16:24:42 |
3 |
4 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
118255 |
1:39:47 |
4 |
1 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
219639 |
2:39:24 |
4 |
2 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
417239 |
10:34:17 |
4 |
3 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
0.1 |
1 |
0.5 |
0.01 |
mse |
211948 |
5:19:50 |
4 |
4 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
111064 |
2:48:40 |
5 |
1 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
268275 |
3:11:07 |
5 |
2 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
1891820 |
1 day, 0:59:46 |
5 |
3 |
16 |
512 |
64 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.05 |
FALSE |
0.1 |
1 |
0.5 |
0.05 |
huber |
1739262 |
2 days, 2:44:45 |
5 |
4 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
370659 |
9:21:28 |
6 |
1 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
244212 |
3:03:29 |
6 |
2 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
535523 |
11:57:35 |
6 |
3 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.01 |
mse |
153598 |
2:46:04 |
6 |
4 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
498686 |
5:42:43 |
7 |
1 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
500220 |
6:06:49 |
7 |
2 |
16 |
512 |
64 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.05 |
FALSE |
0.1 |
1 |
0.5 |
0.05 |
mse |
3218417 |
2 days, 23:13:37 |
7 |
3 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
398336 |
4:39:01 |
7 |
4 |
8 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
201684 |
2:56:41 |
8 |
1 |
16 |
512 |
64 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.05 |
FALSE |
0.1 |
1 |
0.5 |
0.05 |
huber |
3058672 |
3 days, 22:58:34 |
8 |
2 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
723940 |
17:21:15 |
8 |
3 |
16 |
512 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
593399 |
13:29:12 |
8 |
4 |
32 |
756 |
256 |
10 |
0.95 |
0.99 |
0.99 |
7e-5 |
0.05 |
0.2 |
0.5 |
0.25 |
FALSE |
1 |
2 |
0.5 |
0.05 |
mse |
985820 |
1 day, 7:48:34 |
Questions
-
Is this code guaranteed to complete the stages if you try training?
- This hyperparameter does not guarantee you will complete the stage. But I am sure that you can win with this hyperparameter except you have a unlucky day (need 2-3 times to win because of randomness)
-
How long do you train agents?
- Within a few hours to more than 1 day. Time depends on hardware, I use many different hardware so time will not be accurate.
-
How can you improve this code?
- You can separate the test agent part into a separate thread or process. I'm not good at multi-threaded programming so I don't do this.
- You can tuning hyperparameters
- You can apply new network architectures (like attention), maybe it work?
-
What is the importance of RND?
- RND mainly helps complete stage 8-4, which requires more exploration.
- Personally, I feel it doesn't help other stages and slows down the training speed.
- RND adds many hyperparameters making it difficult to choose hyperparameters. But we all know that hyperparameters greatly affect RL.
Requirements
- python 3>3.6
- gym==0.25.2
- gym-super-mario-bros==7.4.0
- imageio
- imageio-ffmpeg
- cv2
- pytorch
- numpy
Acknowledgements
With my code, I can completed all 32/32 stages of Super Mario Bros. This code included new custom reward system (for stage 8-4) and PPO+RND for agent training.
Reference