ganler commented 4 years ago

I just found this video series which seems to have very intuitive instructions for RL. As I am not able to take much time out of my working schedule, I believe this might be a good way to quickly learn the BIG idea of RL in several minutes every day.

https://deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv

ganler commented 4 years ago

Components of an MDP

agent
environment
actions
rewards
states

MDP: Next state is only related to action and state of this time step.

states => environment actions => agent current (action, state) => (next state, reward)

In each round, the agent apply some action(At) to the environment with some state(St). After this movement, the env comes to a new state and the agent is rewarded.

(St, At) => (S{t+1}, R{t+1})

Transition Probabilities. The S & R sets are finite, with some kind of probability distribution.

Screenshot from 2020-08-23 22-42-49

ganler commented 4 years ago

Expected Return:

Cumulative rewards from t to T, where T is the final time step. It is the agent's goal to maximize the expected return of rewards.

The agent-environment interaction actually breaks into subsequences, which is called episodes. (rounds)

Tasks:

Episodic Task: Tasks with independent episodes.
Continuing Task: Tasks that can continue without limit. (Hard, cuz T is infinite)

If we use the expected return directly, where T is infinite, we got infinite return value for continuing tasks. Therefore, we need to refine the way of expected return.

For continuing tasks, instead of maximizing the final expected return, we maximize the expected discounted reward.

To define the discounted return, we first define the discount rate \gamma \in (0, 1].

In the discounted reward setting, the agent will care more about the current reward compared with future rewards. The more immediate reward has more impact.

When the reward at each time step is a constant 1 and \gamma is less than 1:

ganler commented 4 years ago

Policy and value functions:

Policy: What's the probability that an agent will select in a specific state.

Map: given state -> possible actions. If an agent follows policy pi at time t, then pi(a|s) is the probability that action a is performed in state s.

Value Function: Estimate the goodness of action-state pair.

Expected reward: the result of the executed action-state pair. Value Function: the measured goodness of action-state pair. Value Function -> (action, state) -> expected return -> the way the agent acts -> policy

Value Function is a function of states or of (state, action) pairs.

State-Value Functions tell how good a state is. This can be measured using the expect of expected rewards in state s.

Action-value function:

Q-Function => Q-Value.

Q means quality. By evaluating the quality of action & states, we can make good decisions.

ganler commented 4 years ago

Q Learning

Rethink our task: Tell which action to take under some specific state.

So in every episode, we select the action with maximum "Q-value" under the current state.

The Q value stands for long term "goodness" of taking action a in state s.

It can be formalized by:

Bellman Equation

As we don't know what will happen in the future(R(t+2) is unknown). The equation shown above is actually bootless... So, we got a problem: how to estimate Q value?

In every episode, we have:

<s, a> as current state and action
<r, s'> coming state and reward

And the Q value is just the coming reward and the biggest Q value of next state s'.

So we got:

\gamma can be manually set.

The Q value of each state-action pair can be calculated in an iterative way:

Simple and fun. :-)

ganler commented 4 years ago

Deep Q Network (DQN)

Let's elaborate on this by talking about "breakout".

For a scene with N gray pixels, we have 256^N state... This too large to build a Q table(N(states) x N(actions)).

Hence, we can use DNN to represent and simulate the Q function.

Also, you may not want POOLING in RL tasks. Cause such operation will hurt the position information.

How is our NN look like?

Input@{State, Action} => Qvalue
Input@{State} => Qvalues for each function

So, what is the loss function?

Q(s, a) is our prediction.

For each episode, we got: s, a, r, s'

NN forward(s) => Q(s, a)
NN forward(s') => get maximum Q value. max Q(s', a')
Update state & execute action.
Backpropagation.

ganler commented 4 years ago

Explore-exploit dilemma

Experience replay

Q: If we train our NN by order, the NN will be sensitive to recent episodes. (Or using random select) A: We can store past <s, a, r, s'> in replay memory, and take some outdated pairs to train the DNN in mini-batch.

\epsilon-greedy exploration

Let's talk about exploration first. In Q learning, the Q table is actually randomized. In the early stage, the action selection is just "exploration" as the randomly initialized Q table. However, when the iteration runs for a while, the "exploration" goes down.

To tackle this issue, we can use ε-greedy exploration.

In each episode when we're going to select some action, we have the probability of ε to select other non-optimal actions.

In DeepMind's paper, ε begins at 1, and gradually decrease to 0.1.

Finally, we have:

ganler commented 4 years ago

Some code stuff... CartPole

Based on this code base

Q Network Def

# Definition of a Q Network
class DQN(nn.Module):

    def __init__(self, h, w, outputs):
        super(DQN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
        self.bn1 = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
        self.bn2 = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
        self.bn3 = nn.BatchNorm2d(32)

        # Number of Linear input connections depends on output of conv2d layers
        # and therefore the input image size, so compute it.
        def conv2d_size_out(size, kernel_size=5, stride=2):
            return (size - (kernel_size - 1) - 1) // stride  + 1

        convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
        convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))

        linear_input_size = convw * convh * 32
        self.head = nn.Linear(linear_input_size, outputs)

    # Called with either one element to determine next action, or a batch
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        return self.head(x.view(x.shape[0], -1))

The head here is the Q value for each action. And here we have 2 actions(left, right).

Structure

Let's define what is <action, state>

action: Left or right
state: The difference between 2 successive frames.

The training loop:

for i_episode in range(num_episodes):
    # Initialize the environment and state in the new episode.
    env.reset()
    last_screen = get_screen()
    current_screen = get_screen()
    state = current_screen - last_screen
    for t in count():
        # Select and perform an action
        action = select_action(state)
        _, reward, done, _ = env.step(action.item())
        reward = torch.tensor([reward], device=device)

        # Observe new state
        last_screen = current_screen
        current_screen = get_screen()
        if not done:
            next_state = current_screen - last_screen
        else:
            next_state = None

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the target network)
        optimize_model()
        if done:
            episode_durations.append(t + 1)
            plot_durations()
            break
    # Update the target network, copying all weights and biases in DQN
    if i_episode % TARGET_UPDATE == 0:
        target_net.load_state_dict(policy_net.state_dict())

ganler commented 4 years ago

Improvements to DQN

Target Network(soft update)

Why soft update:

https://greentec.github.io/reinforcement-learning-third-en/#soft-update-target-network

Double DQN

Q(s,a)=Q(s,a)+α(R+γmaxQ(s′,a′)−Q(s,a))

In the original setting, Q value usually grows very high, cause max(Q) is a big value.

Hence, we can use 2 DQN to select Q value from each other.

Along with SoftUpdate, we can have the following architecture.

https://greentec.github.io/reinforcement-learning-third-en/#double-dqn

ganler / ResearchReading

[CrashCourse] DEEPLIZARD.RL #17

Q Learning

Bellman Equation

Deep Q Network (DQN)

Explore-exploit dilemma

Experience replay

\epsilon-greedy exploration

Some code stuff... CartPole

Q Network Def

Structure

Improvements to DQN

Target Network(soft update)

Double DQN