Vectorizing Policy Evaluation

hsz1992 commented 6 years ago

def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):
    v = np.zeros(shape=(env.nS, 1))      # value vector index by state
    R = np.zeros(shape=(env.nS, 1))      # reward vector index by state
    P = np.zeros(shape=(env.nS, env.nS)) # state transition matrix (from, to)

    # Construct R and P
    for s in range(env.nS):
        for a, action_prob in enumerate(policy[s]):
            for prob, next_state, reward, done in env.P[s][a]:
                R[s] += action_prob * prob * reward
                P[s][next_state] += action_prob * prob

    # Start iterating
    while True:
        v_prev = v
        v = R + discount_factor * np.dot(P, v)
        if np.max(np.abs(v - v_prev)) < theta:
            break
    return np.squeeze(v)

random_policy = np.ones([env.nS, env.nA]) / env.nA
%timeit v = policy_eval(random_policy, env)
# Output: 2.17 ms ± 62 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jonahweissman commented 6 years ago

It looks like this belongs in a pull request

dennybritz commented 6 years ago

Thanks for the code! I did not vectorize the implementation on purpose because this repository is meant as a learning tool, and I think the code is a bit more intuitive if it's not vectorized.

iSarCasm commented 6 years ago

@zuanzuan1992 just what I was looking for!

dennybritz / reinforcement-learning

Vectorizing Policy Evaluation #150