Reproduce figure 13.2 - Githubissues

kir0ul commented 2 years ago

The Chapter13_Short_Corridor.jl notebook says:

Interested in how to reproduce figure 13.2? A PR is warmly welcomed!

Might as well put this as an issue in case anyone is interested to make a contribution! :slightly_smiling_face:

baedan commented 2 years ago

well, i had i shot at it. it works -- with the caveat that i can't get it to converge reliably using the learning rate given in the book. i've found $\alpha^\theta = 2^{-9}$ to be too large, which leads it to diverge about half of the time.

$\alpha^\theta = 2^{-11}$ works well.

begin
    Base.@kwdef struct LinearPreferenceBaselineApproximator{F,O,App<:AbstractApproximator} <: AbstractApproximator
        weight::Vector{Float64}
        feature_func::F
        actions::Int
        opt::O
        state_value_approximator::App
    end

    function (A::LinearPreferenceBaselineApproximator)(s)
        h = [dot(A.weight, A.feature_func(s, a)) for a in 1:A.actions]
        softmax(h)
    end

    function RLBase.update!(A::LinearPreferenceBaselineApproximator, correction::Pair)
        (s, a), Δ = correction
        δ = Δ - A.state_value_approximator(s)
        update!(A.state_value_approximator, s => -δ)
        w, x = A.weight, A.feature_func
        w̄ = -δ .* (x(s,a) .- sum(A(s) .* [x(s, b) for b in 1:A.actions]))
        Flux.Optimise.update!(A.opt, w, w̄)
    end
end

this suffers from the same problem as figure 13.1, in that the algorithm diverges often when using learning rate given by the book.

i took a look at the original implementation, and i think the absence of this line is probably the culprit.

while (s is not None) and (t < max_timesteps):

this improvement should apply to many of the earlier graphs as well.

findmyway commented 2 years ago

Those who are interested in this issue may also take a look into the code here:

https://github.com/gravesec/chapter_13_figures

baedan commented 2 years ago

another hidden gotcha: the original implementation starts learning from the end of the episode, which is inconsistent with the pseudocode provided in the book.

JuliaReinforcementLearning / ReinforcementLearningAnIntroduction.jl

Reproduce figure 13.2 #68