I have implemented the E3B intrinsic reward proposed here. I have added the SuperMarioBros environment, which I have used to validate the E3B implementation. I have also fixed the pretraining mode for on-policy agents:
Before: the intrinsic rewards are only added to the extrinsic returns and advantages.
Now: if on pretraining mode, compute the intrinsic returns and intrinsic advantages. If using intrinsic + extrinsic rewards, do as before.
This has significantly increased the performance of intrinsic reward algorithms in pre-training mode.
This is the performance of PPO+E3B during pretraining mode in the SuperMarioBros-1-1-v3 environment (i.e. without access to task rewards!)
Motivation and Context
1) E3B is a recent algorithm that achieves SOTA results in complex environments, so it's a valuable contribution.
2) During the pretraining phase, the intrinsic rewards were not being optimized properly
3) Added the SuperMarioBros environment because it is cool and helps evaluating the performance of exploration algorithms since in Mario, good exploratory agents achieve high task rewards.
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
Description
I have implemented the E3B intrinsic reward proposed here. I have added the SuperMarioBros environment, which I have used to validate the E3B implementation. I have also fixed the pretraining mode for on-policy agents:
Before: the intrinsic rewards are only added to the extrinsic returns and advantages. Now: if on pretraining mode, compute the intrinsic returns and intrinsic advantages. If using intrinsic + extrinsic rewards, do as before.
This has significantly increased the performance of intrinsic reward algorithms in pre-training mode.
This is the performance of PPO+E3B during pretraining mode in the SuperMarioBros-1-1-v3 environment (i.e. without access to task rewards!)
Motivation and Context
1) E3B is a recent algorithm that achieves SOTA results in complex environments, so it's a valuable contribution. 2) During the pretraining phase, the intrinsic rewards were not being optimized properly 3) Added the SuperMarioBros environment because it is cool and helps evaluating the performance of exploration algorithms since in Mario, good exploratory agents achieve high task rewards.
Types of changes
Checklist
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)make doc
(required)