[FEATURE] Implement Rainbow DQN

EdanToledo / Stoix

🏛️A research-friendly codebase for fast experimentation of single-agent reinforcement learning in JAX • End-to-End JAX RL

Apache License 2.0

188 stars 14 forks source link

[FEATURE] Implement Rainbow DQN #81

Closed RPegoud closed 2 months ago

RPegoud commented 2 months ago

Feature: Implement Rainbow

Rainbow (paper) is a combination of several DQN variations:

Vanilla DQN (Q-learning + CNN)
Double DQN
Prioritized Experience Replay (via Flashbax)
Dueling DQN
DQN with multistep bootstrap targets
Distributional DQN
Noisy DQN (paper)

Checklist:

[x] Implement Noisy DQN
[x] Rainbow steps:
- [x] C51 DQN with multistep distributional loss
- [x] Double Q-learning for the loss function
- [x] Prioritized experience replay w.r.t the KL loss
- [x] Dueling network architecture adapted for return distributions
- [x] Linear layers are replaced with their noisy equivalent
[ ] Sanity check: the resulting algorithm should outperform its individual components

RPegoud commented 2 months ago

So I guess the first step for Noisy DQN would be to define new linear layers as:

$$\begin{align} y&=(\mu^w+\sigma^w \odot \epsilon^w)\cdot x + \mu^b + \sigma^b \odot \epsilon^b \ \ &\begin{cases} \text{weight } &= \mu^w+\sigma^w \odot \epsilon^w \ \text{bias } &= \mu^b + \sigma^b \odot \epsilon^b \ \mu^w,\mu^b, \sigma^w, \sigma^b&: \text{learnable parameters} \ \epsilon^w, \epsilon^b&: \text{noise random variables} \end{cases} \end{align}$$

Any recommendations to get started with this?

EdanToledo commented 2 months ago

So I guess the first step for Noisy DQN would be to define new linear layers as:

y=(μw+σw⊙ϵw)⋅x+μb+σb⊙ϵb{weight =μw+σw⊙ϵwbias =μb+σb⊙ϵbμw,μb,σw,σb:learnable parametersϵw,ϵb:noise random variables

Any recommendations to get started with this?

Hey, so yeah basically, i imagine something like that. I guess structurally we can create a new file in the networks folder called layers.py and put a noisy linear layer in there. Then a new network in torso.py that is a noisy MLP torso.

Additionally, regarding "DQN with multistep bootstrap targets (is this currently supported?)" the easiest short term solution for this is to use a trajectory buffer instead of a item buffer. That will allow us very easily to construct the n-step targets and is most likely how i would do it if i had to code it right now.