Open Van314159 opened 6 months ago
The same happened to me, I think this is caused by how the algorithm handles terminating environment. Here is an example trace from this environment:
:state, :action, :terminal, :next_state
4 1 0 3
3 1 0 2
2 2 0 3
3 1 0 2
2 1 1 1
1 0 0 4
4 2 0 5
5 2 0 6
6 2 1 7
7 0 0 4
4 1 0 3
So you can see, the agent marks a step as :terminal
if its :next_state
is a terminal state (1 or 7 in this env
). After this terminal step, there is another step which has the actual terminal state as :state
and the new initial state for the next step as :next_state
, and this weird intermediate step has :action=0
, which is not a valid action in this env
and cannot be used to index the Q-table of course.
I don't know what the reason was to include the intermediate steps in the trace with :action=0
, but they need to be removed somehow for learning.
I see, I think that's because a DummySampler is used. The "0" actions are dummy actions pushed to the replay buffer to keep the traces in sync (you have more states that actions in an episode). These time steps should not be sampleable as they are not meaningful. There should be an alternative to DummySampler that samples the whole buffer without the invalid time steps.
But since :next_state
is part of the trace, why are those intermediate time steps necessary to keep traces in sync? One could either drop these time steps OR the :next_state
and wouldn't lose any information.
Or are you saying implementation-wise state
and next_state
are a view onto the same memory and hence it cannot be dropped?
Yes they are stored in the same memory space that's why.
@HenriDeh Isn't the issue here?
function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)
for batch in trajectory.container
optimise!(learner, stage, batch)
end
end
e.g. if unsampleable trajectory observations were not available in the iterate
method of the trajectory, things should just work?
To help keep track of things, here's the full stack trace for the above example:
julia> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())
ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
[1] getindex
@ ./essentials.jl:14 [inlined]
[2] maybeview
@ ./views.jl:149 [inlined]
[3] forward
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/tabular_approximator.jl:50 [inlined]
[4] Q
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:39 [inlined]
[5] bellman_update!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:59 [inlined]
[6] _optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:75 [inlined]
[7] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:82 [inlined]
[8] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:92 [inlined]
[9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
@ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:87
[10] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/q_based_policy.jl:42 [inlined]
[11] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:35 [inlined]
[12] optimise!
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:34 [inlined]
[13] macro expansion
@ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
[14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
@ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:61
[15] run
@ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:30 [inlined]
[16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
@ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:29
[17] top-level scope
@ REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.
@HenriDeh (Just reread your RLTraj.jl issue and see you already proposed this solution). 🙈
I see, I think that's because a DummySampler is used.
I experimented with this again, but I get the same error when using BatchSampler
:
using ReinforcementLearning
using ReinforcementLearningTrajectories
env = RandomWalk1D()
policy = QBasedPolicy(
learner=TDLearner(
TabularQApproximator(
n_state=length(state_space(env)),
n_action=length(action_space(env)),
),
:SARS
),
explorer=EpsilonGreedyExplorer(0.1)
)
trajectory = Trajectory(
ElasticArraySARTSTraces(;
state=Int64 => (),
action=Int64 => (),
reward=Float64 => (),
terminal=Bool => (),
),
BatchSampler(5),
# DummySampler(),
InsertSampleRatioController(),
)
agent = Agent(
policy=policy,
trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(10), TotalRewardPerEpisode())
This produces the following error. Is there a working way to use this package right now?
ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
[1] getindex
@ ./essentials.jl:14 [inlined]
[2] maybeview
@ ./views.jl:149 [inlined]
[3] forward
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/tabular_approximator.jl:50 [inlined]
[4] Q
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:39 [inlined]
[5] bellman_update!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:59 [inlined]
[6] _optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:75 [inlined]
[7] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:82 [inlined]
[8] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:92 [inlined]
[9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:87
[10] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/q_based_policy.jl:42 [inlined]
[11] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:35 [inlined]
[12] optimise!
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:34 [inlined]
[13] macro expansion
@ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
[14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:61
[15] run
@ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:30 [inlined]
[16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:29
[17] top-level scope
@ ~/dev/jl/RLAlgorithms/scripts/investigate_traces.jl:34
Some type information was truncated. Use `show(err)` to see complete types.
Some more info: First collect traces with random policy:
agent = Agent(
policy=RandomPolicy(),
trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(2), TotalRewardPerEpisode())
julia> trajectory.container
EpisodesBuffer containing
Traces with 5 entries:
:state => 9-element RelativeTrace
:next_state => 9-element RelativeTrace
:action => 9-elements Trace{ElasticArrays.ElasticVector{Int64, Vector{Int64}}}
:reward => 9-elements Trace{ElasticArrays.ElasticVector{Float64, Vector{Float64}}}
:terminal => 9-elements Trace{ElasticArrays.ElasticVector{Bool, Vector{Bool}}}
julia> l = length(trajectory.container)
9
julia> traces = trajectory.container[1:l]
(state = [4, 5, 6, 7, 4, 3, 4, 3, 2], next_state = [5, 6, 7, 4, 3, 4, 3, 2, 1], action = [2, 2, 2, 0, 1, 2, 1, 1, 1], reward = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0], terminal = Bool[0, 0, 1, 0, 0, 0, 0, 0, 1])
julia> sampl = trajectory.container.sampleable_inds[1:l]
9-element BitVector:
1
1
1
0
1
1
1
1
1
julia> hcat(traces.state, traces.terminal, traces.action, traces.next_state, sampl)
9×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}:
4 0 2 5 1
5 0 2 6 1
6 1 2 7 1
7 0 0 4 0
4 0 1 3 1
3 0 2 4 1
4 0 1 3 1
3 0 1 2 1
2 1 1 1 1
Iterating over container vs iterating over trajectory:
julia> @which iterate(trajectory.container)
iterate(A::AbstractArray)
@ Base abstractarray.jl:1214
julia> for data in trajectory.container
@show data
end
data = (state = 4, next_state = 5, action = 2, reward = 0.0, terminal = false)
data = (state = 5, next_state = 6, action = 2, reward = 0.0, terminal = false)
data = (state = 6, next_state = 7, action = 2, reward = 1.0, terminal = true)
data = (state = 7, next_state = 4, action = 0, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 4, action = 2, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 2, action = 1, reward = 0.0, terminal = false)
data = (state = 2, next_state = 1, action = 1, reward = -1.0, terminal = true)
julia> @which iterate(trajectory)
iterate(t::Trajectory, args...)
@ ReinforcementLearningTrajectories ~/dev/jl/RLAlgorithms/dev/ReinforcementLearningTrajectories/src/trajectory.jl:132
julia> for batch in trajectory
@show batch
end
batch = (state = [4, 5, 5], next_state = [3, 6, 6], action = [1, 2, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])
batch = (state = [2, 4, 6], next_state = [1, 3, 7], action = [1, 1, 2], reward = [-1.0, 0.0, 1.0], terminal = Bool[1, 0, 1])
batch = (state = [2, 4, 4], next_state = [1, 3, 3], action = [1, 1, 1], reward = [-1.0, 0.0, 0.0], terminal = Bool[1, 0, 0])
batch = (state = [3, 2, 3], next_state = [4, 1, 2], action = [2, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 2, 3], next_state = [3, 1, 4], action = [1, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [5, 2, 5], next_state = [6, 1, 6], action = [2, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 2], next_state = [3, 3, 1], action = [1, 1, 1], reward = [0.0, 0.0, -1.0], terminal = Bool[0, 0, 1])
batch = (state = [4, 3, 6], next_state = [5, 4, 7], action = [2, 2, 2], reward = [0.0, 0.0, 1.0], terminal = Bool[0, 0, 1])
batch = (state = [3, 2, 4], next_state = [2, 1, 3], action = [1, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 3], next_state = [3, 3, 4], action = [1, 1, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])
So when iterating over trajectory.container
, the dummy action 0
is part of it, when iterating over the trajectory
object itself, action 0
is never sampled (also tried with larger buffer).
So does that mean that
function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)
needs to iterate over trajectory
instead of over trajectory.container
, as you hinted above @jeremiahpslewis ?
Apart from that, it seems a bit odd to me that iterating over a trajectory
of container length 9
with BatchSampler(3)
produces 10
batches of 3
samples each, totalling 30
examples (with repetitions). I would have expected it to produce N disjunct batches that in total cover the sampleable data without repetitions. But I have not fully understood how the sampler and controller of the trajectory work yet, maybe this behavior can be adjusted with them?
It also seems inconsistent to me that length(trajectory.container) == 9
. The container contains 8
sampleable states and 10
actual states. For some reason the last dummy transition with action 0 is not considered part of the trace, but the other dummy actions are (length(trajectory.container.sampleable_inds) == 10
).
I followed the
RandomWalk1D()
example in the tutorial and wanted to update the agent. Butrun
function returnsBoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
if I use theTDLearner
. My code isIt returns
The above code works if I stop the simulation early, i.e., specify
StopAfterNSteps(3)
. It also works forRandomPolicy()
.