CVHvn / Mario_PPO_RND

Playing Super Mario Bros with Proximal Policy Optimization (PPO) and Random Network Distillation (RND)
MIT License
2 stars 0 forks source link

Mario_PPO_RND

Playing Super Mario Bros with Proximal Policy Optimization (PPO) and Random Network Distillation (RND)

Introduction

My PyTorch Proximal Policy Optimization (PPO) + Random Network Distillation (RND) implement to playing Super Mario Bros. There are PPO paper and RND paper.









Results

Motivation

I just tried both A2C and PPO, but my algorithms can't complete the hardest stage (stage 8-4). PPO only helped Mario complete 31/32 stages. When I try to play stage 8-4 with PPO, I encounter three problems:

How to use it

You can use my notebook for training and testing agent very easy:

Or you can use train.py and test.py if you don't want to use notebook:

Trained models

You can find trained model in folder trained_model

Hyperparameters

How I Find Hyperparameters for Each Stage:

World Stage num_envs learn_step batchsize epoch lambda gamma gamma_int learning_rate target_kl clip_param max_grad_norm update_proportion norm_adv int_adv_coef ext_adv_coef V_coef entropy_coef loss_type training_step training_time
default 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 435990 5:30:41
1 1 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 435990 5:30:41
1 2 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 467440 7:43:46
1 3 16 512 64 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.05 FALSE 0.1 1 0.5 0.05 mse 444917 15:32:29
1 4 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 64982 0:42:23
2 1 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 1202627 14:45:26
2 2 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 0.1 1 0.5 0.01 mse 1876990 1 day, 18:37:18
2 3 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 392697 6:06:10
2 4 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 145339 2:16:03
3 1 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 193534 4:51:21
3 2 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 195001 3:15:37
3 3 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 843736 16:24:42
3 4 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 118255 1:39:47
4 1 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 219639 2:39:24
4 2 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 417239 10:34:17
4 3 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 0.1 1 0.5 0.01 mse 211948 5:19:50
4 4 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 111064 2:48:40
5 1 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 268275 3:11:07
5 2 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 1891820 1 day, 0:59:46
5 3 16 512 64 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.05 FALSE 0.1 1 0.5 0.05 huber 1739262 2 days, 2:44:45
5 4 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 370659 9:21:28
6 1 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 244212 3:03:29
6 2 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 535523 11:57:35
6 3 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.01 mse 153598 2:46:04
6 4 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 498686 5:42:43
7 1 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 500220 6:06:49
7 2 16 512 64 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.05 FALSE 0.1 1 0.5 0.05 mse 3218417 2 days, 23:13:37
7 3 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 398336 4:39:01
7 4 8 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 201684 2:56:41
8 1 16 512 64 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.05 FALSE 0.1 1 0.5 0.05 huber 3058672 3 days, 22:58:34
8 2 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 723940 17:21:15
8 3 16 512 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 593399 13:29:12
8 4 32 756 256 10 0.95 0.99 0.99 7e-5 0.05 0.2 0.5 0.25 FALSE 1 2 0.5 0.05 mse 985820 1 day, 7:48:34

Questions

Requirements

Acknowledgements

With my code, I can completed all 32/32 stages of Super Mario Bros. This code included new custom reward system (for stage 8-4) and PPO+RND for agent training.

Reference