JuliaReinforcementLearning / ReinforcementLearning.jl

A reinforcement learning package for Julia
https://juliareinforcementlearning.org
Other
592 stars 112 forks source link

RL Core tests fail sporadically #1010

Closed joelreymont closed 9 months ago

joelreymont commented 9 months ago

I have to re-run tests a few times to get this test to fail and it's always the same test.


     Testing Running tests...
WARNING: Method definition timeit_debug_enabled() in module ReinforcementLearningCore at /Users/joelr/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:180 overwritten at /Users/joelr/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:188.
OfflineAgent: Test Failed at /Users/joelr/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:69
  Expression: length(a_2.trajectory.container) == 5
   Evaluated: 6 == 5

Stacktrace:
 [1] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:69 [inlined]
 [3] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:43 [inlined]
 [5] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [6] top-level scope
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:5
OfflineAgent: Test Failed at /Users/joelr/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76
  Expression: length(agent.trajectory.container) in (0, 5)
   Evaluated: 6 in (0, 5)

Stacktrace:
 [1] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76 [inlined]
 [3] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:43 [inlined]
 [5] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [6] top-level scope
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:5
OfflineAgent: Test Failed at /Users/joelr/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76
  Expression: length(agent.trajectory.container) in (0, 5)
   Evaluated: 6 in (0, 5)

Stacktrace:
 [1] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76 [inlined]
 [3] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:43 [inlined]
 [5] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [6] top-level scope
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:5
OfflineAgent: Test Failed at /Users/joelr/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76
  Expression: length(agent.trajectory.container) in (0, 5)
   Evaluated: 6 in (0, 5)

Stacktrace:
 [1] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76 [inlined]
 [3] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:43 [inlined]
 [5] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [6] top-level scope
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:5
OfflineAgent: Test Failed at /Users/joelr/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76
  Expression: length(agent.trajectory.container) in (0, 5)
   Evaluated: 6 in (0, 5)

Stacktrace:
 [1] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:672 [inlined]
 [2] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:76 [inlined]
 [3] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [4] macro expansion
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:43 [inlined]
 [5] macro expansion
   @ ~/.julia/juliaup/julia-1.10.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
 [6] top-level scope
   @ ~/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/policies/agent.jl:5
[ Info: initializing tictactoe state info cache...
[ Info: finished initializing tictactoe state info cache in 0.670085458 seconds
Test Summary:                                                   | Pass  Fail  Total   Time
ReinforcementLearningCore.jl                                    |  656     5    661  53.7s
  core                                                          |    3            3   1.1s
  TotalRewardPerEpisode                                         |  105          105   0.7s
  DoEveryNStep                                                  |   68           68   0.1s
  TimePerStep                                                   |   42           42   1.0s
  StepsPerEpisode                                               |   16           16   0.1s
  RewardsPerEpisode                                             |   33           33   0.0s
  DoOnExit                                                      |    1            1   0.0s
  DoEveryNEpisode                                               |   84           84   0.1s
  StopAfterStep                                                 |    2            2   0.0s
  ComposedStopCondition                                         |    1            1   0.0s
  StopAfterEpisode                                              |    6            6   0.0s
  StopAfterNoImprovement                                        |   12           12   0.2s
  agent.jl                                                      |   20     5     25   0.6s
    Agent Tests                                                 |   12           12   0.3s
    OfflineAgent                                                |    8     5     13   0.3s
  MultiAgentPolicy                                              |    1            1   0.0s
  MultiAgentHook                                                |    1            1   0.1s
  CurrentPlayerIterator                                         |    1            1   0.0s
  Basic TicTacToeEnv (Sequential) env checks                    |   15           15   1.3s
  next_player!                                                  |    1            1   0.0s
  Basic RockPaperScissors (simultaneous) env checks             |   22           22   0.5s
  Sequential Environments correctly ended by termination signal |    1            1   0.2s
  approximators.jl                                              |   10           10   4.3s
  base                                                          |   44           44   1.3s
  device                                                        |    4            4   0.3s
  StackFrames                                                   |    5            5   0.7s
  Approximators                                                 |  136          136  32.9s
  utils/distributions                                           |   22           22   6.4s
ERROR: LoadError: Some tests did not pass: 656 passed, 5 failed, 0 errored, 0 broken.
in expression starting at /Users/joelr/Work/Julia/ReinforcementLearning.jl/src/ReinforcementLearningCore/test/runtests.jl:13
ERROR: Package ReinforcementLearningCore errored during testing
joelreymont commented 9 months ago

Git bisect together with lots of running of tests by hand points to commit e1d9e9e as the bad one

❯ git bisect good
e1d9e9e21a0a3955667a1276b1140b3b72bf9d4b is the first bad commit
commit e1d9e9e21a0a3955667a1276b1140b3b72bf9d4b
Author: Henri Dehaybe <47037088+HenriDeh@users.noreply.github.com>
Date:   Thu Oct 26 10:11:22 2023 +0200

    Conservative Q-Learning (#995)

    * divide sac into functions

    * bump version

    * implement CQL

    * create OfflineAgent (does not collect online data)

    * working state

    * experiments working

    * typo

    * Tests pass

    * add finetuning

    * write doc

    * Update src/ReinforcementLearningCore/src/policies/agent/agent_base.jl

    * Update src/ReinforcementLearningZoo/src/algorithms/offline_rl/CQL_SAC.jl

    * Apply suggestions from code review

    * add review suggestions

    * remove finetuning

    * fix a ProgressMeter deprecation warning

    ---------

    Co-authored-by: Jeremiah <4462211+jeremiahpslewis@users.noreply.github.com>

 src/ReinforcementLearningCore/Project.toml         |  5 +-
 .../src/core/stop_conditions.jl                    |  4 +-
 .../src/policies/agent/agent.jl                    |  1 +
 .../src/policies/agent/agent_base.jl               | 13 +--
 .../src/policies/agent/offline_agent.jl            | 76 +++++++++++++++++
 .../test/policies/agent.jl                         | 38 +++++++++
 src/ReinforcementLearningExperiments/Project.toml  |  2 +-
 .../experiments/Offline/JuliaRL_CQLSAC_Pendulum.jl | 98 ++++++++++++++++++++++
 .../Policy Gradient/JuliaRL_SAC_Pendulum.jl        |  2 +-
 .../src/ReinforcementLearningExperiments.jl        |  1 +
 .../test/runtests.jl                               |  1 +
 src/ReinforcementLearningZoo/Project.toml          |  5 +-
 .../src/ReinforcementLearningZoo.jl                |  1 +
 .../src/algorithms/algorithms.jl                   |  2 +-
 .../src/algorithms/offline_rl/CQL_SAC.jl           | 93 ++++++++++++++++++++
 .../src/algorithms/offline_rl/offline_rl.jl        |  4 +-
 .../src/algorithms/policy_gradient/sac.jl          | 45 ++++++----
 17 files changed, 357 insertions(+), 34 deletions(-)
 create mode 100644 src/ReinforcementLearningCore/src/policies/agent/offline_agent.jl
 create mode 100644 src/ReinforcementLearningExperiments/deps/experiments/experiments/Offline/JuliaRL_CQLSAC_Pendulum.jl
 create mode 100644 src/ReinforcementLearningZoo/src/algorithms/offline_rl/CQL_SAC.jl
jeremiahpslewis commented 9 months ago

OfflineAgent seems to be the culprit...

joelreymont commented 9 months ago

I'm trying to figure this out...

joelreymont commented 9 months ago

I've spent 2-3 days digging into this already and it's time to ask for help!

I have figured out what's going on with this function but I can't figure out why

Base.push!(::OfflineAgent{P,T, <: OfflineBehavior{Nothing}}, ::PreExperimentStage, env::AbstractEnv) where {P,T} = nothing
#fills the trajectory with interactions generated with the behavior_agent at the PreExperimentStage.
function Base.push!(agent::OfflineAgent{P,T, <: OfflineBehavior{<:Agent}}, ::PreExperimentStage, env::AbstractEnv) where {P,T}
    is_stop = false
    policy = agent.offline_behavior.agent
    steps = 0
    while !is_stop
        reset!(env)
        push!(policy, PreEpisodeStage(), env)
        while !agent.offline_behavior.reset_condition(policy, env) # one episode
            steps += 1
            push!(policy, PreActStage(), env)
            action = RLBase.plan!(policy, env)
            act!(env, action)
            push!(policy, PostActStage(), env, action)
            if steps >= agent.offline_behavior.steps
                is_stop = true
                break
            end
        end # end of an episode
    push!(policy, PostEpisodeStage(), env)
    end    
end

If agent.offline_behavior.reset_condition is not triggered then the test completes just fine. Otherwise, we get an extra item in the agent.trajectory.container. The reason for getting the extra item is that we reset the environment at the top of the outer loop and then push!(policy, PreEpisodeStage(), env). This push does not insert anything at the beginning of the function when steps is 0 but does insert an item if we call the function again.

I inserted printouts after each push into the trajectory container and can see this behavior clearly. I also tried to dig down into the trajectory push method and further down. For the life of me I can't figure out why container length does not increase at the beginning of the iteration!

>>> iterating with step 0 and container length 0
container length 0 after env reset

XXX nothing is inserted here  by "push!(policy, PreEpisodeStage(), env)"

starting episode loop with step 0 and container length 0
container = []
steps = 1
container before pushing PreActStage = []
container after pushing PreActStage = []
container after acting = []
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
steps = 2
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
steps = 3
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true)]
ending episode. steps = 3, ended = true
container length 3 after pushing PostEpisodeStage
container after episode = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true)]

>>> iterating with step 3 and container length 3
container length 3 after env reset

XXX one item is inserted here  by "push!(policy, PreEpisodeStage(), env)"

starting episode loop with step 3 and container length 4
container = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
steps = 4
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
steps = 5
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 4, action = 1, reward = 0.0f0, terminal = false)]
stopping at 5 steps!
ending episode. steps = 5, ended = false
container length 6 after pushing PostEpisodeStage
container after episode = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 4, action = 1, reward = 0.0f0, terminal = false)]
final container = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 4, action = 1, reward = 0.0f0, terminal = false)]Ï
jeremiahpslewis commented 9 months ago

Thanks for looking into this!!! I’ll dive into it tomorrow. :)

joelreymont commented 9 months ago

I wish I could set breakpoints in tests (I'm using VSCode) but that seems to be impossible. I read existing Discourse threads and experimented with TestItemRunner to no avail.

FYI, we first hit the AbstractAgent push method and then jump over to the Trajectory push method.

joelreymont commented 9 months ago

I feel stupid now but this is the classic case of solving a problem by talking to a rubber duckie. Asking for help works just as well :-). I missed the EpisodesBuffer push method which is likely the one eating up a push. Digging deeper!

joelreymont commented 9 months ago

This seems to allow for eb.traces to be empty after an insert. How can traces be empty after an insert, though?

joelreymont commented 9 months ago

What I would like to figure out is who is making the decision to count or not count the first trace. If the first trace does not get inserted at all then how does it show up when I print out the traces at the end of the test?

jeremiahpslewis commented 9 months ago

Here, perhaps? https://github.com/JuliaReinforcementLearning/ReinforcementLearningTrajectories.jl/blob/4b112be2a29dfc22339b229f94cf82471c79d34f/src/episodes.jl#L132

joelreymont commented 9 months ago

I printed out the value of partial and it's true on the first insert of PreEpisodeStage as well as the second (if the reset condition is triggered mid episode).

joelreymont commented 9 months ago

There seems to be nothing wrong with the code and the test is buggy. Fixing the test instead.

jeremiahpslewis commented 9 months ago

Awesome. Merged. Thanks!

HenriDeh commented 9 months ago

Oof, sorry I was on vacations, I could have helped. Brave of you to dig into all that.

joelreymont commented 9 months ago

Thank you Henri! I learned a lot about Julia in the process.

Trying to get into reinforcement learning now, pun intended!