Closed kir0ul closed 2 years ago
well, i had i shot at it. it works -- with the caveat that i can't get it to converge reliably using the learning rate given in the book. i've found $\alpha^\theta = 2^{-9}$ to be too large, which leads it to diverge about half of the time.
$\alpha^\theta = 2^{-11}$ works well.
begin
Base.@kwdef struct LinearPreferenceBaselineApproximator{F,O,App<:AbstractApproximator} <: AbstractApproximator
weight::Vector{Float64}
feature_func::F
actions::Int
opt::O
state_value_approximator::App
end
function (A::LinearPreferenceBaselineApproximator)(s)
h = [dot(A.weight, A.feature_func(s, a)) for a in 1:A.actions]
softmax(h)
end
function RLBase.update!(A::LinearPreferenceBaselineApproximator, correction::Pair)
(s, a), Δ = correction
δ = Δ - A.state_value_approximator(s)
update!(A.state_value_approximator, s => -δ)
w, x = A.weight, A.feature_func
w̄ = -δ .* (x(s,a) .- sum(A(s) .* [x(s, b) for b in 1:A.actions]))
Flux.Optimise.update!(A.opt, w, w̄)
end
end
this suffers from the same problem as figure 13.1, in that the algorithm diverges often when using learning rate given by the book.
i took a look at the original implementation, and i think the absence of this line is probably the culprit.
while (s is not None) and (t < max_timesteps):
this improvement should apply to many of the earlier graphs as well.
Those who are interested in this issue may also take a look into the code here:
another hidden gotcha: the original implementation starts learning from the end of the episode, which is inconsistent with the pseudocode provided in the book.
The
Chapter13_Short_Corridor.jl
notebook says:Might as well put this as an issue in case anyone is interested to make a contribution! :slightly_smiling_face: