Closed ganler closed 4 years ago
Components of an MDP
MDP: Next state is only related to action and state of this time step.
states => environment actions => agent current (action, state) => (next state, reward)
In each round, the agent apply some action(At) to the environment with some state(St). After this movement, the env comes to a new state and the agent is rewarded.
(St, At) => (S{t+1}, R{t+1})
Transition Probabilities. The S & R sets are finite, with some kind of probability distribution.
Expected Return:
Cumulative rewards from t to T, where T is the final time step. It is the agent's goal to maximize the expected return of rewards.
The agent-environment interaction actually breaks into subsequences, which is called episodes. (rounds)
Tasks:
If we use the expected return
directly, where T is infinite, we got infinite return value for continuing tasks. Therefore, we need to refine the way of expected return.
For continuing tasks, instead of maximizing the final expected return, we maximize the expected discounted reward.
To define the discounted return, we first define the discount rate \gamma \in (0, 1].
In the discounted reward
setting, the agent will care more about the current reward compared with future rewards. The more immediate reward has more impact.
When the reward at each time step is a constant 1 and \gamma is less than 1:
Policy and value functions:
Policy: What's the probability that an agent will select in a specific state.
Map: given state -> possible actions. If an agent follows policy pi
at time t
, then pi(a|s)
is the probability that action a
is performed in state s.
Value Function: Estimate the goodness of action-state pair.
Expected reward: the result of the executed action-state pair. Value Function: the measured goodness of action-state pair. Value Function -> (action, state) -> expected return -> the way the agent acts -> policy
Value Function is a function of states or of (state, action) pairs.
State-Value Functions tell how good a state is. This can be measured using the expect of expected rewards in state s.
Action-value function:
Q-Function => Q-Value.
Q means quality. By evaluating the quality of action & states, we can make good decisions.
Rethink our task: Tell which action to take under some specific state.
So in every episode, we select the action with maximum "Q-value" under the current state.
The Q value stands for long term "goodness" of taking action a in state s.
It can be formalized by:
As we don't know what will happen in the future(R(t+2) is unknown). The equation shown above is actually bootless... So, we got a problem: how to estimate Q value?
In every episode, we have:
And the Q value is just the coming reward and the biggest Q value of next state s'.
So we got:
\gamma can be manually set.
The Q value of each state-action pair can be calculated in an iterative way:
Simple and fun. :-)
Let's elaborate on this by talking about "breakout".
For a scene with N gray pixels, we have 256^N state... This too large to build a Q table(N(states) x N(actions)).
Hence, we can use DNN to represent and simulate the Q function.
Also, you may not want POOLING in RL tasks. Cause such operation will hurt the position information.
How is our NN look like?
So, what is the loss function?
Q(s, a) is our prediction.
For each episode, we got: s, a, r, s'
Q: If we train our NN by order, the NN will be sensitive to recent episodes. (Or using random select) A: We can store past <s, a, r, s'> in replay memory, and take some outdated pairs to train the DNN in mini-batch.
Let's talk about exploration first. In Q learning, the Q table is actually randomized. In the early stage, the action selection is just "exploration" as the randomly initialized Q table. However, when the iteration runs for a while, the "exploration" goes down.
To tackle this issue, we can use ε-greedy exploration.
In each episode when we're going to select some action, we have the probability of ε to select other non-optimal actions.
In DeepMind's paper, ε begins at 1, and gradually decrease to 0.1.
Finally, we have:
Based on this code base
# Definition of a Q Network
class DQN(nn.Module):
def __init__(self, h, w, outputs):
super(DQN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
self.bn1 = nn.BatchNorm2d(16)
self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
self.bn2 = nn.BatchNorm2d(32)
self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
self.bn3 = nn.BatchNorm2d(32)
# Number of Linear input connections depends on output of conv2d layers
# and therefore the input image size, so compute it.
def conv2d_size_out(size, kernel_size=5, stride=2):
return (size - (kernel_size - 1) - 1) // stride + 1
convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
linear_input_size = convw * convh * 32
self.head = nn.Linear(linear_input_size, outputs)
# Called with either one element to determine next action, or a batch
# during optimization. Returns tensor([[left0exp,right0exp]...]).
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
return self.head(x.view(x.shape[0], -1))
The head here is the Q value for each action. And here we have 2 actions(left, right).
Let's define what is <action, state>
The training loop:
for i_episode in range(num_episodes):
# Initialize the environment and state in the new episode.
env.reset()
last_screen = get_screen()
current_screen = get_screen()
state = current_screen - last_screen
for t in count():
# Select and perform an action
action = select_action(state)
_, reward, done, _ = env.step(action.item())
reward = torch.tensor([reward], device=device)
# Observe new state
last_screen = current_screen
current_screen = get_screen()
if not done:
next_state = current_screen - last_screen
else:
next_state = None
# Store the transition in memory
memory.push(state, action, next_state, reward)
# Move to the next state
state = next_state
# Perform one step of the optimization (on the target network)
optimize_model()
if done:
episode_durations.append(t + 1)
plot_durations()
break
# Update the target network, copying all weights and biases in DQN
if i_episode % TARGET_UPDATE == 0:
target_net.load_state_dict(policy_net.state_dict())
Why soft update:
https://greentec.github.io/reinforcement-learning-third-en/#soft-update-target-network
Q(s,a)=Q(s,a)+α(R+γmaxQ(s′,a′)−Q(s,a))
In the original setting, Q value usually grows very high, cause max(Q) is a big value.
Hence, we can use 2 DQN to select Q value from each other.
Along with SoftUpdate, we can have the following architecture.
https://greentec.github.io/reinforcement-learning-third-en/#double-dqn
I just found this video series which seems to have very intuitive instructions for RL. As I am not able to take much time out of my working schedule, I believe this might be a good way to quickly learn the BIG idea of RL in several minutes every day.
https://deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv