JuliaReinforcementLearning / ReinforcementLearning.jl

A reinforcement learning package for Julia
https://juliareinforcementlearning.org
Other
586 stars 114 forks source link

It's not feasible to update the Q-based value agent in large steps for the RandomWalk1D() environment. #1068

Open Van314159 opened 6 months ago

Van314159 commented 6 months ago

I followed the RandomWalk1D() example in the tutorial and wanted to update the agent. But run function returns BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1] if I use the TDLearner. My code is

> envRW = RandomWalk1D()
> NS = length(state_space(envRW))
> NA = length(action_space(envRW))
> agentRW = Agent(
    policy = QBasedPolicy(
           learner = TDLearner(
                   TabularQApproximator(
                       n_state = NS,
                       n_action = NA,
                   ),
                   :SARS
               ),
           explorer = EpsilonGreedyExplorer(0.1)
       ),
    trajectory = Trajectory(
           ElasticArraySARTSTraces(;
               state = Int64 => (),
               action = Int64 => (),
               reward = Float64 => (),
               terminal = Bool => (),
           ),
           DummySampler(),
           InsertSampleRatioController(),
       )
)

> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())

It returns

BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]

The above code works if I stop the simulation early, i.e., specify StopAfterNSteps(3). It also works for RandomPolicy().

johannes-fischer commented 6 months ago

The same happened to me, I think this is caused by how the algorithm handles terminating environment. Here is an example trace from this environment:

:state, :action, :terminal, :next_state
 4  1  0  3
 3  1  0  2
 2  2  0  3
 3  1  0  2
 2  1  1  1
 1  0  0  4
 4  2  0  5
 5  2  0  6
 6  2  1  7
 7  0  0  4
 4  1  0  3

So you can see, the agent marks a step as :terminal if its :next_state is a terminal state (1 or 7 in this env). After this terminal step, there is another step which has the actual terminal state as :state and the new initial state for the next step as :next_state, and this weird intermediate step has :action=0, which is not a valid action in this env and cannot be used to index the Q-table of course.

I don't know what the reason was to include the intermediate steps in the trace with :action=0, but they need to be removed somehow for learning.

HenriDeh commented 6 months ago

I see, I think that's because a DummySampler is used. The "0" actions are dummy actions pushed to the replay buffer to keep the traces in sync (you have more states that actions in an episode). These time steps should not be sampleable as they are not meaningful. There should be an alternative to DummySampler that samples the whole buffer without the invalid time steps.

johannes-fischer commented 6 months ago

But since :next_state is part of the trace, why are those intermediate time steps necessary to keep traces in sync? One could either drop these time steps OR the :next_state and wouldn't lose any information.

johannes-fischer commented 6 months ago

Or are you saying implementation-wise state and next_state are a view onto the same memory and hence it cannot be dropped?

HenriDeh commented 6 months ago

Yes they are stored in the same memory space that's why.

jeremiahpslewis commented 6 months ago

@HenriDeh Isn't the issue here?

function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)
    for batch in trajectory.container
        optimise!(learner, stage, batch)
    end
end

e.g. if unsampleable trajectory observations were not available in the iterate method of the trajectory, things should just work?

jeremiahpslewis commented 6 months ago

To help keep track of things, here's the full stack trace for the above example:

julia> run(agentRW, envRW, StopAfterNEpisodes(10), TotalRewardPerEpisode())
ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:14 [inlined]
  [2] maybeview
    @ ./views.jl:149 [inlined]
  [3] forward
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/tabular_approximator.jl:50 [inlined]
  [4] Q
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:39 [inlined]
  [5] bellman_update!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:59 [inlined]
  [6] _optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:75 [inlined]
  [7] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:82 [inlined]
  [8] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:92 [inlined]
  [9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/learners/td_learner.jl:87
 [10] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/q_based_policy.jl:42 [inlined]
 [11] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:35 [inlined]
 [12] optimise!
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/policies/agent/agent_base.jl:34 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
 [14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:61
 [15] run
    @ /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:30 [inlined]
 [16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
    @ ReinforcementLearningCore /workspaces/ReinforcementLearning.jl/src/ReinforcementLearningCore/src/core/run.jl:29
 [17] top-level scope
    @ REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.
jeremiahpslewis commented 6 months ago

@HenriDeh (Just reread your RLTraj.jl issue and see you already proposed this solution). 🙈

johannes-fischer commented 3 months ago

I see, I think that's because a DummySampler is used.

I experimented with this again, but I get the same error when using BatchSampler:

using ReinforcementLearning
using ReinforcementLearningTrajectories
env = RandomWalk1D()
policy = QBasedPolicy(
    learner=TDLearner(
        TabularQApproximator(
            n_state=length(state_space(env)),
            n_action=length(action_space(env)),
        ),
        :SARS
    ),
    explorer=EpsilonGreedyExplorer(0.1)
)
trajectory = Trajectory(
    ElasticArraySARTSTraces(;
        state=Int64 => (),
        action=Int64 => (),
        reward=Float64 => (),
        terminal=Bool => (),
    ),
    BatchSampler(5),
    # DummySampler(),
    InsertSampleRatioController(),
)
agent = Agent(
    policy=policy,
    trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(10), TotalRewardPerEpisode())

This produces the following error. Is there a working way to use this package right now?

ERROR: BoundsError: attempt to access 2×7 Matrix{Float64} at index [0, 1]
Stacktrace:
  [1] getindex
    @ ./essentials.jl:14 [inlined]
  [2] maybeview
    @ ./views.jl:149 [inlined]
  [3] forward
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/tabular_approximator.jl:50 [inlined]
  [4] Q
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:39 [inlined]
  [5] bellman_update!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:59 [inlined]
  [6] _optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:75 [inlined]
  [7] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:82 [inlined]
  [8] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:92 [inlined]
  [9] optimise!(learner::TDLearner{…}, stage::PostActStage, trajectory::Trajectory{…})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/learners/td_learner.jl:87
 [10] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/q_based_policy.jl:42 [inlined]
 [11] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:35 [inlined]
 [12] optimise!
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/policies/agent/agent_base.jl:34 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
 [14] _run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…}, reset_condition::ResetIfEnvTerminated)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:61
 [15] run
    @ ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:30 [inlined]
 [16] run(policy::Agent{…}, env::RandomWalk1D, stop_condition::StopAfterNEpisodes{…}, hook::TotalRewardPerEpisode{…})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/nftAp/src/core/run.jl:29
 [17] top-level scope
    @ ~/dev/jl/RLAlgorithms/scripts/investigate_traces.jl:34
Some type information was truncated. Use `show(err)` to see complete types.
johannes-fischer commented 3 months ago

Some more info: First collect traces with random policy:

agent = Agent(
    policy=RandomPolicy(),
    trajectory=trajectory
)
run(agent, env, StopAfterNEpisodes(2), TotalRewardPerEpisode())
julia> trajectory.container
EpisodesBuffer containing
Traces with 5 entries:
  :state => 9-element RelativeTrace
  :next_state => 9-element RelativeTrace
  :action => 9-elements Trace{ElasticArrays.ElasticVector{Int64, Vector{Int64}}}
  :reward => 9-elements Trace{ElasticArrays.ElasticVector{Float64, Vector{Float64}}}
  :terminal => 9-elements Trace{ElasticArrays.ElasticVector{Bool, Vector{Bool}}}
julia> l = length(trajectory.container)
9

julia> traces = trajectory.container[1:l]
(state = [4, 5, 6, 7, 4, 3, 4, 3, 2], next_state = [5, 6, 7, 4, 3, 4, 3, 2, 1], action = [2, 2, 2, 0, 1, 2, 1, 1, 1], reward = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0], terminal = Bool[0, 0, 1, 0, 0, 0, 0, 0, 1])

julia> sampl = trajectory.container.sampleable_inds[1:l]
9-element BitVector:
 1
 1
 1
 0
 1
 1
 1
 1
 1

julia> hcat(traces.state, traces.terminal, traces.action, traces.next_state, sampl)
9×5 ElasticArrays.ElasticMatrix{Int64, Vector{Int64}}:
 4  0  2  5  1
 5  0  2  6  1
 6  1  2  7  1
 7  0  0  4  0
 4  0  1  3  1
 3  0  2  4  1
 4  0  1  3  1
 3  0  1  2  1
 2  1  1  1  1

Iterating over container vs iterating over trajectory:

julia> @which iterate(trajectory.container)
iterate(A::AbstractArray)
     @ Base abstractarray.jl:1214

julia> for data in trajectory.container
           @show data
       end
data = (state = 4, next_state = 5, action = 2, reward = 0.0, terminal = false)
data = (state = 5, next_state = 6, action = 2, reward = 0.0, terminal = false)
data = (state = 6, next_state = 7, action = 2, reward = 1.0, terminal = true)
data = (state = 7, next_state = 4, action = 0, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 4, action = 2, reward = 0.0, terminal = false)
data = (state = 4, next_state = 3, action = 1, reward = 0.0, terminal = false)
data = (state = 3, next_state = 2, action = 1, reward = 0.0, terminal = false)
data = (state = 2, next_state = 1, action = 1, reward = -1.0, terminal = true)

julia> @which iterate(trajectory)
iterate(t::Trajectory, args...)
     @ ReinforcementLearningTrajectories ~/dev/jl/RLAlgorithms/dev/ReinforcementLearningTrajectories/src/trajectory.jl:132

julia> for batch in trajectory
           @show batch
       end
batch = (state = [4, 5, 5], next_state = [3, 6, 6], action = [1, 2, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])
batch = (state = [2, 4, 6], next_state = [1, 3, 7], action = [1, 1, 2], reward = [-1.0, 0.0, 1.0], terminal = Bool[1, 0, 1])
batch = (state = [2, 4, 4], next_state = [1, 3, 3], action = [1, 1, 1], reward = [-1.0, 0.0, 0.0], terminal = Bool[1, 0, 0])
batch = (state = [3, 2, 3], next_state = [4, 1, 2], action = [2, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 2, 3], next_state = [3, 1, 4], action = [1, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [5, 2, 5], next_state = [6, 1, 6], action = [2, 1, 2], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 2], next_state = [3, 3, 1], action = [1, 1, 1], reward = [0.0, 0.0, -1.0], terminal = Bool[0, 0, 1])
batch = (state = [4, 3, 6], next_state = [5, 4, 7], action = [2, 2, 2], reward = [0.0, 0.0, 1.0], terminal = Bool[0, 0, 1])
batch = (state = [3, 2, 4], next_state = [2, 1, 3], action = [1, 1, 1], reward = [0.0, -1.0, 0.0], terminal = Bool[0, 1, 0])
batch = (state = [4, 4, 3], next_state = [3, 3, 4], action = [1, 1, 2], reward = [0.0, 0.0, 0.0], terminal = Bool[0, 0, 0])

So when iterating over trajectory.container, the dummy action 0 is part of it, when iterating over the trajectory object itself, action 0 is never sampled (also tried with larger buffer).

So does that mean that

function RLBase.optimise!(learner::TDLearner, stage::AbstractStage, trajectory::Trajectory)

needs to iterate over trajectory instead of over trajectory.container, as you hinted above @jeremiahpslewis ?


Apart from that, it seems a bit odd to me that iterating over a trajectory of container length 9 with BatchSampler(3) produces 10 batches of 3 samples each, totalling 30 examples (with repetitions). I would have expected it to produce N disjunct batches that in total cover the sampleable data without repetitions. But I have not fully understood how the sampler and controller of the trajectory work yet, maybe this behavior can be adjusted with them?


It also seems inconsistent to me that length(trajectory.container) == 9. The container contains 8 sampleable states and 10 actual states. For some reason the last dummy transition with action 0 is not considered part of the trace, but the other dummy actions are (length(trajectory.container.sampleable_inds) == 10).