Kaixhin / Atari

Persistent advantage learning dueling double DQN for the Arcade Learning Environment
MIT License
263 stars 74 forks source link

Hierarchical Dqn #9

Closed lake4790k closed 8 years ago

lake4790k commented 8 years ago

Interesting paper http://arxiv.org/abs/1604.06057 that tackles Montezuma's revenge, where Dqn does not work at all because of the delayed sparse rewards, it requires a longer strategy to be followed without any external rewards. They solve this by setting intrinsic goals that can be set in a general way by auto detecting shapes and rewarding their proximity. Their solution for defining the goals from the shapes is handcrafted for this game, but probably could be done in a general way across games.

Not a straightforward implementation issue though...

Kaixhin commented 8 years ago

Yep, fantastic paper, but the handcrafted intrinsic goal system is an issue. I'm sure something can be done for Atari games/Catch in general, but I would wait until a proper method has been tested.

lake4790k commented 8 years ago

Another hierarchical paper http://arxiv.org/abs/1604.07255 for Minecraft. Demonstrated on a simple setup similar to Labyrinth.

Kaixhin commented 8 years ago

This one requires learning "Deep Skill Networks" in advance, which is even more handcrafted. I'm sure we'll be seeing more hierarchical DQN papers, but until these stop being tuned very specifically towards a certain game/problem I don't think there's much point trying to add one. Hopefully this repo can be a useful base to work from though!

lake4790k commented 8 years ago

Congrats on your paper!

Kaixhin commented 8 years ago

Thanks - as I said, we'll be seeing more papers combining deep reinforcement learning and hierarchical reinforcement learning. Forking this repo to use was pretty convenient - hopefully we can speed up everyone's research with async methods 👍

mryellow commented 8 years ago

If I were to hack in a DSN style implementation today, I'd go for a proxy environment. Seeing actions are selected which can then be a subsequent DQN with a separate reward signal, it would make sense to implement these in the environment where they can be passed the state if their action is selected. Fits with the "hand crafted" nature, you could then train them separately from there.

Something I'm musing about though would be a system where each nets forward pass happens before the final layers, which ultimately select actions. Then using the last FC layers of the "skill" nets as input to the final DQN. My math noob brain imagines there must be some way to concatenate the actions from each, then pass separate rewards from the environment back through each net in a vectorised way (while sharing experience memory regardless of "skill" being trained). It would be interesting to see the final deciding DQN adjusting weights in those intermediate FC layers along with each individual reward, or it could be cut off at this point. For domains where you're taking the "advice" of each "skill" net and deciding on an action for the best result given the current state as seen by those nets, mixing them rather than executing a long-running skill.

Something structural which might help with this in the long run is how models are defined, rlenvs are external and can be tweaked without touching your codebase. I've done a lot of local tweaks to models and the whole ale vs catch param thing gets a bit messy when adding your own extra models and it's easy to diverge from your code into something less likely to merge cleanly. Keep wondering if models could be defined in JSON or something, but coding them is just so much more precise and flexible.

Kaixhin commented 8 years ago

@mryellow If I get what you're suggesting, you'd like the two environments to be more decoupled from the code? I don't think that's going to be fully possible with the current structure, but there's probably some things that can be done. I'd wait until @lake4790k has finished merging the async branch into master, and then can see what's possible. In fact, I'll make a new issue for it.

Going to close this issue for now with regards to the various approaches to hierarchical DQNs.

mryellow commented 8 years ago

If I get what you're suggesting, you'd like the two environments to be more decoupled from the code?

Not so much something I'd like, but if I were to forge ahead and hack in something for my own benefit I'd likely do it in a proxy environment without touching your codebase. Your end would just select actions, then the proxy environment would either pass these straight to the real environment or execute another NN for a "skill".