ganler / ResearchReading

General system research material (not limited to paper) reading notes.
GNU General Public License v3.0
20 stars 1 forks source link

[CrashCourse] DEEPLIZARD.RL #17

Closed ganler closed 4 years ago

ganler commented 4 years ago

I just found this video series which seems to have very intuitive instructions for RL. As I am not able to take much time out of my working schedule, I believe this might be a good way to quickly learn the BIG idea of RL in several minutes every day.

https://deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv

ganler commented 4 years ago

Components of an MDP

MDP: Next state is only related to action and state of this time step.

states => environment actions => agent current (action, state) => (next state, reward)

image

In each round, the agent apply some action(At) to the environment with some state(St). After this movement, the env comes to a new state and the agent is rewarded.

(St, At) => (S{t+1}, R{t+1})

Transition Probabilities. The S & R sets are finite, with some kind of probability distribution.

Screenshot from 2020-08-23 22-42-49

ganler commented 4 years ago

Expected Return:

Cumulative rewards from t to T, where T is the final time step. It is the agent's goal to maximize the expected return of rewards.

The agent-environment interaction actually breaks into subsequences, which is called episodes. (rounds)

Tasks:

If we use the expected return directly, where T is infinite, we got infinite return value for continuing tasks. Therefore, we need to refine the way of expected return.

For continuing tasks, instead of maximizing the final expected return, we maximize the expected discounted reward.

To define the discounted return, we first define the discount rate \gamma \in (0, 1].

image

In the discounted reward setting, the agent will care more about the current reward compared with future rewards. The more immediate reward has more impact.

When the reward at each time step is a constant 1 and \gamma is less than 1:

image

ganler commented 4 years ago

Policy and value functions:

Policy: What's the probability that an agent will select in a specific state.

Map: given state -> possible actions. If an agent follows policy pi at time t, then pi(a|s) is the probability that action a is performed in state s.

Value Function: Estimate the goodness of action-state pair.

Expected reward: the result of the executed action-state pair. Value Function: the measured goodness of action-state pair. Value Function -> (action, state) -> expected return -> the way the agent acts -> policy

Value Function is a function of states or of (state, action) pairs.

State-Value Functions tell how good a state is. This can be measured using the expect of expected rewards in state s.

image

Action-value function:

image

Q-Function => Q-Value.

Q means quality. By evaluating the quality of action & states, we can make good decisions.

ganler commented 4 years ago

Q Learning

Rethink our task: Tell which action to take under some specific state.

So in every episode, we select the action with maximum "Q-value" under the current state.

The Q value stands for long term "goodness" of taking action a in state s.

It can be formalized by:

image

Bellman Equation

As we don't know what will happen in the future(R(t+2) is unknown). The equation shown above is actually bootless... So, we got a problem: how to estimate Q value?

In every episode, we have:

And the Q value is just the coming reward and the biggest Q value of next state s'.

So we got:

image

\gamma can be manually set.

The Q value of each state-action pair can be calculated in an iterative way:

image

Simple and fun. :-)

ganler commented 4 years ago

Deep Q Network (DQN)

Let's elaborate on this by talking about "breakout".

image

For a scene with N gray pixels, we have 256^N state... This too large to build a Q table(N(states) x N(actions)).

Hence, we can use DNN to represent and simulate the Q function.

Also, you may not want POOLING in RL tasks. Cause such operation will hurt the position information.

How is our NN look like?

  1. Input@{State, Action} => Qvalue
  2. Input@{State} => Qvalues for each function

image

So, what is the loss function?

image

Q(s, a) is our prediction.

For each episode, we got: s, a, r, s'

  1. NN forward(s) => Q(s, a)
  2. NN forward(s') => get maximum Q value. max Q(s', a')
  3. Update state & execute action.
  4. Backpropagation.
ganler commented 4 years ago

Explore-exploit dilemma

Experience replay

Q: If we train our NN by order, the NN will be sensitive to recent episodes. (Or using random select) A: We can store past <s, a, r, s'> in replay memory, and take some outdated pairs to train the DNN in mini-batch.

\epsilon-greedy exploration

Let's talk about exploration first. In Q learning, the Q table is actually randomized. In the early stage, the action selection is just "exploration" as the randomly initialized Q table. However, when the iteration runs for a while, the "exploration" goes down.

To tackle this issue, we can use ε-greedy exploration.

In each episode when we're going to select some action, we have the probability of ε to select other non-optimal actions.

In DeepMind's paper, ε begins at 1, and gradually decrease to 0.1.

Finally, we have:

image

ganler commented 4 years ago

Some code stuff... CartPole

Based on this code base

Q Network Def

# Definition of a Q Network
class DQN(nn.Module):

    def __init__(self, h, w, outputs):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
        self.bn3 = nn.BatchNorm2d(32)

        # Number of Linear input connections depends on output of conv2d layers
        # and therefore the input image size, so compute it.
        def conv2d_size_out(size, kernel_size=5, stride=2):
            return (size - (kernel_size - 1) - 1) // stride  + 1

        convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
        convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))

        linear_input_size = convw * convh * 32
        self.head = nn.Linear(linear_input_size, outputs)

    # Called with either one element to determine next action, or a batch
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        return self.head(x.view(x.shape[0], -1))

The head here is the Q value for each action. And here we have 2 actions(left, right).

Structure

image

Let's define what is <action, state>

The training loop:

for i_episode in range(num_episodes):
    # Initialize the environment and state in the new episode.
    env.reset()
    last_screen = get_screen()
    current_screen = get_screen()
    state = current_screen - last_screen
    for t in count():
        # Select and perform an action
        action = select_action(state)
        _, reward, done, _ = env.step(action.item())
        reward = torch.tensor([reward], device=device)

        # Observe new state
        last_screen = current_screen
        current_screen = get_screen()
        if not done:
            next_state = current_screen - last_screen
        else:
            next_state = None

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the target network)
        optimize_model()
        if done:
            episode_durations.append(t + 1)
            plot_durations()
            break
    # Update the target network, copying all weights and biases in DQN
    if i_episode % TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())
ganler commented 4 years ago

Improvements to DQN

Target Network(soft update)

image

Why soft update:

https://greentec.github.io/reinforcement-learning-third-en/#soft-update-target-network

Double DQN

Q(s,a)=Q(s,a)+α(R+γmaxQ(s′,a′)−Q(s,a))

In the original setting, Q value usually grows very high, cause max(Q) is a big value.

Hence, we can use 2 DQN to select Q value from each other.

Along with SoftUpdate, we can have the following architecture.

image

https://greentec.github.io/reinforcement-learning-third-en/#double-dqn

image