This is published Ubisoft on 2020.Nov.
It applies RL in game navigation to replace original navmesh method and achieved better results.
[Innovation]
Navigation in Robotics and RL
-- SLAM: The classical approach to navigation in robotics is called simultaneous localization and
mapping (SLAM) [Leonard and Durrant-Whyte, 1991, Durrant-Whyte and Bailey, 2006], which
builds a high-level map (usually a top-down view) of the world from experience and locates the
agent within this map (Yu: This is used in car auto-driving)
Issues: Similar to the NavMesh, SLAM-based approaches struggle to integrate
navigation abilities in the mapping. Moreover, in the case of video games, we already have exact
localization and mapping and thus do not need to use an estimate.
-- Model-based RL
Like SLAM and NavMesh-based approaches, model-based RL approaches could theoretically handle
complex navigation actions and potentially be used to plan shortest paths. However, there are two
notable drawbacks to using model-based approaches for planning. Firstly, like SLAM and NavMeshbased approaches, they are expensive to run at inference/mapping time as they need to compute
many forward passes of the model to determine the best path using the complex navigation abilities.
Secondly, when the model is estimated from data, they tend to suffer from compounding error due
to model imperfections, which makes planning through the trained dynamics model challenging (Yu: simulator could be seen as a model??)
-- Model-free RL
As navigation in a visually complex environment is usually modeled as a partially observable markov
decision process, the importance of using memory has been previously acknowledged. The use of auxiliary tasks to accelerate the learning of challenging goal-based RL problems has also been a subject of study.
Approach
States: The state is composed of local perception in the form of a 3D occupancy map and a 2D
depth map, as well as scalar information about physical attributes of the agent and its goal (velocity,
relative goal position, absolute goal position, and previous action).
Actions: The actions are continuous, and correspond to jump, forward, strafe, rotate. The jump is
treated as a continuous action on the algorithmic side and binarized in the environment.
Rewards: To avoid complications associated with long term credit assignment when using a
sparse reward, we densify the reward signal to be Rt = max(min∀i∈[|0,t−1|] Di(agent, goal) −
Dt(agent, goal), 0)+α+1Dt(agent,goal)≤ where Dt is the Euclidean distance between the positions
of its arguments at time t, α is a penalty given at each step and is the distance below which the
agent is considered to have reached its goal. Intuitively, this reward signal encourages the agent to
get closer to its goal and reach it as fast as possible.
Training procedure: As running a game engine is costly and in order to be more sample efficient,
we use an off-policy RL algorithm called Soft Actor-Critic [Haarnoja et al., 2018a] modified so
that the entropy coefficient is learned and there is no state value network [Haarnoja et al., 2018b].
The critic and policy networks share layers that are tasked with extracting an embedding from local
Link: https://arxiv.org/pdf/2011.04764.pdf
This is published Ubisoft on 2020.Nov. It applies RL in game navigation to replace original navmesh method and achieved better results.
[Innovation] Navigation in Robotics and RL -- SLAM: The classical approach to navigation in robotics is called simultaneous localization and mapping (SLAM) [Leonard and Durrant-Whyte, 1991, Durrant-Whyte and Bailey, 2006], which builds a high-level map (usually a top-down view) of the world from experience and locates the agent within this map (Yu: This is used in car auto-driving) Issues: Similar to the NavMesh, SLAM-based approaches struggle to integrate navigation abilities in the mapping. Moreover, in the case of video games, we already have exact localization and mapping and thus do not need to use an estimate.
-- Model-based RL Like SLAM and NavMesh-based approaches, model-based RL approaches could theoretically handle complex navigation actions and potentially be used to plan shortest paths. However, there are two notable drawbacks to using model-based approaches for planning. Firstly, like SLAM and NavMeshbased approaches, they are expensive to run at inference/mapping time as they need to compute many forward passes of the model to determine the best path using the complex navigation abilities. Secondly, when the model is estimated from data, they tend to suffer from compounding error due to model imperfections, which makes planning through the trained dynamics model challenging (Yu: simulator could be seen as a model??)
-- Model-free RL As navigation in a visually complex environment is usually modeled as a partially observable markov decision process, the importance of using memory has been previously acknowledged. The use of auxiliary tasks to accelerate the learning of challenging goal-based RL problems has also been a subject of study.
Approach![image](https://user-images.githubusercontent.com/4425199/102571616-e0bc5880-4124-11eb-9691-ec7cedc222be.png)
States: The state is composed of local perception in the form of a 3D occupancy map and a 2D depth map, as well as scalar information about physical attributes of the agent and its goal (velocity, relative goal position, absolute goal position, and previous action). Actions: The actions are continuous, and correspond to jump, forward, strafe, rotate. The jump is treated as a continuous action on the algorithmic side and binarized in the environment. Rewards: To avoid complications associated with long term credit assignment when using a sparse reward, we densify the reward signal to be Rt = max(min∀i∈[|0,t−1|] Di(agent, goal) − Dt(agent, goal), 0)+α+1Dt(agent,goal)≤ where Dt is the Euclidean distance between the positions of its arguments at time t, α is a penalty given at each step and is the distance below which the agent is considered to have reached its goal. Intuitively, this reward signal encourages the agent to get closer to its goal and reach it as fast as possible. Training procedure: As running a game engine is costly and in order to be more sample efficient, we use an off-policy RL algorithm called Soft Actor-Critic [Haarnoja et al., 2018a] modified so that the entropy coefficient is learned and there is no state value network [Haarnoja et al., 2018b]. The critic and policy networks share layers that are tasked with extracting an embedding from local