Curiosity-driven Exploration by Self-supervised Prediction

msrks commented 7 years ago

https://pathak22.github.io/noreward-rl/

In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent al- together. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful later in its life. We formulate curiosity as the error in an agent’s ability to pre- dict the consequence of its own actions in a vi- sual feature space learned by a self-supervised inverse dynamics model. Our formulation scales to high-dimensional continuous state spaces like images, bypasses the difficulties of directly pre- dicting pixels, and, critically, ignores the aspects of the environment that cannot affect the agent. The proposed approach is evaluated in two en- vironments: VizDoom and Super Mario Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where curiosity allows for far fewer interactions with the environment to reach the goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent to explore more efficiently; and 3) generalization to unseen sce- narios (e.g. new levels of the same game) where the knowledge gained from earlier experience helps the agent explore new places much faster than starting from scratch.

msrks commented 7 years ago

好奇心 Curiosity の概念を導入することで、スパースな報酬しか与えられない環境においても、効率の良い探索 Exploration を実現することを示した。

好奇心に対する報酬と通常の報酬の合計を最大化するように学習する。

ここで、効率の良い探索を与える、好奇心とはどのように定義すればいいだろうか。

本文中では、観測を以下の３種類に分けて、(1)、(2)に対する観測がAgentにとって重要であると述べている。

let us divide all sources that can modify the agent’s observations into three cases: (1) things that can be controlled by the agent; (2) things that the agent cannot control but that can affect the agent (e.g. a vehicle driven by another agent), and (3) things out of the agent’s control and not affecting the agent (e.g. moving leaves). A good feature space for curiosity should model (1) and (2) and be unaffected by (3). This latter is because, if there is a source of variation that is inconsequential for the agent, then the agent has no incentive to know about it

そこで「好奇心を満たす行動」を「行動をして次にどのような観測が得られるか予測しづらいような行動」とした。ここでいう観測は(1)、(2)に対するもので、(3)に関する観測については予測しづらくても問題ない。

上記のことを実現するモジュールとしてICMを導入した。下図の右のICMでは、（右）観測情報の特徴量として、行動を予測するために重要なものだけを残すように学習して、（左）その特徴量を予測しづらいような行動が、好奇心を満たす行動とする（報酬が大きい）

報酬の最大化はA3Cを使って強化学習した

msrks commented 7 years ago

一言で言うと、好奇心を満たすような（予測しづらいような）行動をとったほうが、効率よく探索できるので、それを報酬に反映したらいい結果が出た。と理解してもらっておk

furukawa-ai / deeplearning_papers

Curiosity-driven Exploration by Self-supervised Prediction #28