Closed joelreymont closed 9 months ago
Git bisect together with lots of running of tests by hand points to commit e1d9e9e as the bad one
❯ git bisect good
e1d9e9e21a0a3955667a1276b1140b3b72bf9d4b is the first bad commit
commit e1d9e9e21a0a3955667a1276b1140b3b72bf9d4b
Author: Henri Dehaybe <47037088+HenriDeh@users.noreply.github.com>
Date: Thu Oct 26 10:11:22 2023 +0200
Conservative Q-Learning (#995)
* divide sac into functions
* bump version
* implement CQL
* create OfflineAgent (does not collect online data)
* working state
* experiments working
* typo
* Tests pass
* add finetuning
* write doc
* Update src/ReinforcementLearningCore/src/policies/agent/agent_base.jl
* Update src/ReinforcementLearningZoo/src/algorithms/offline_rl/CQL_SAC.jl
* Apply suggestions from code review
* add review suggestions
* remove finetuning
* fix a ProgressMeter deprecation warning
---------
Co-authored-by: Jeremiah <4462211+jeremiahpslewis@users.noreply.github.com>
src/ReinforcementLearningCore/Project.toml | 5 +-
.../src/core/stop_conditions.jl | 4 +-
.../src/policies/agent/agent.jl | 1 +
.../src/policies/agent/agent_base.jl | 13 +--
.../src/policies/agent/offline_agent.jl | 76 +++++++++++++++++
.../test/policies/agent.jl | 38 +++++++++
src/ReinforcementLearningExperiments/Project.toml | 2 +-
.../experiments/Offline/JuliaRL_CQLSAC_Pendulum.jl | 98 ++++++++++++++++++++++
.../Policy Gradient/JuliaRL_SAC_Pendulum.jl | 2 +-
.../src/ReinforcementLearningExperiments.jl | 1 +
.../test/runtests.jl | 1 +
src/ReinforcementLearningZoo/Project.toml | 5 +-
.../src/ReinforcementLearningZoo.jl | 1 +
.../src/algorithms/algorithms.jl | 2 +-
.../src/algorithms/offline_rl/CQL_SAC.jl | 93 ++++++++++++++++++++
.../src/algorithms/offline_rl/offline_rl.jl | 4 +-
.../src/algorithms/policy_gradient/sac.jl | 45 ++++++----
17 files changed, 357 insertions(+), 34 deletions(-)
create mode 100644 src/ReinforcementLearningCore/src/policies/agent/offline_agent.jl
create mode 100644 src/ReinforcementLearningExperiments/deps/experiments/experiments/Offline/JuliaRL_CQLSAC_Pendulum.jl
create mode 100644 src/ReinforcementLearningZoo/src/algorithms/offline_rl/CQL_SAC.jl
OfflineAgent
seems to be the culprit...
I'm trying to figure this out...
I've spent 2-3 days digging into this already and it's time to ask for help!
I have figured out what's going on with this function but I can't figure out why
Base.push!(::OfflineAgent{P,T, <: OfflineBehavior{Nothing}}, ::PreExperimentStage, env::AbstractEnv) where {P,T} = nothing
#fills the trajectory with interactions generated with the behavior_agent at the PreExperimentStage.
function Base.push!(agent::OfflineAgent{P,T, <: OfflineBehavior{<:Agent}}, ::PreExperimentStage, env::AbstractEnv) where {P,T}
is_stop = false
policy = agent.offline_behavior.agent
steps = 0
while !is_stop
reset!(env)
push!(policy, PreEpisodeStage(), env)
while !agent.offline_behavior.reset_condition(policy, env) # one episode
steps += 1
push!(policy, PreActStage(), env)
action = RLBase.plan!(policy, env)
act!(env, action)
push!(policy, PostActStage(), env, action)
if steps >= agent.offline_behavior.steps
is_stop = true
break
end
end # end of an episode
push!(policy, PostEpisodeStage(), env)
end
end
If agent.offline_behavior.reset_condition
is not triggered then the test completes just fine. Otherwise, we get an extra item in the agent.trajectory.container
. The reason for getting the extra item is that we reset the environment at the top of the outer loop and then push!(policy, PreEpisodeStage(), env)
. This push does not insert anything at the beginning of the function when steps
is 0 but does insert an item if we call the function again.
I inserted printouts after each push into the trajectory container and can see this behavior clearly. I also tried to dig down into the trajectory push method and further down. For the life of me I can't figure out why container length does not increase at the beginning of the iteration!
>>> iterating with step 0 and container length 0
container length 0 after env reset
XXX nothing is inserted here by "push!(policy, PreEpisodeStage(), env)"
starting episode loop with step 0 and container length 0
container = []
steps = 1
container before pushing PreActStage = []
container after pushing PreActStage = []
container after acting = []
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
steps = 2
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
steps = 3
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true)]
ending episode. steps = 3, ended = true
container length 3 after pushing PostEpisodeStage
container after episode = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true)]
>>> iterating with step 3 and container length 3
container length 3 after env reset
XXX one item is inserted here by "push!(policy, PreEpisodeStage(), env)"
starting episode loop with step 3 and container length 4
container = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
steps = 4
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
steps = 5
container before pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PreActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after acting = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false)]
container after pushing PostActStage = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 4, action = 1, reward = 0.0f0, terminal = false)]
stopping at 5 steps!
ending episode. steps = 5, ended = false
container length 6 after pushing PostEpisodeStage
container after episode = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 4, action = 1, reward = 0.0f0, terminal = false)]
final container = [(state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 6, action = 2, reward = 0.0f0, terminal = false), (state = 6, next_state = 7, action = 2, reward = 1.0f0, terminal = true), (state = 7, next_state = 4, action = 0, reward = 0.0f0, terminal = false), (state = 4, next_state = 5, action = 2, reward = 0.0f0, terminal = false), (state = 5, next_state = 4, action = 1, reward = 0.0f0, terminal = false)]Ï
Thanks for looking into this!!! I’ll dive into it tomorrow. :)
I wish I could set breakpoints in tests (I'm using VSCode) but that seems to be impossible. I read existing Discourse threads and experimented with TestItemRunner to no avail.
FYI, we first hit the AbstractAgent push method and then jump over to the Trajectory push method.
I feel stupid now but this is the classic case of solving a problem by talking to a rubber duckie. Asking for help works just as well :-). I missed the EpisodesBuffer push method which is likely the one eating up a push. Digging deeper!
This seems to allow for eb.traces
to be empty after an insert. How can traces be empty after an insert, though?
What I would like to figure out is who is making the decision to count or not count the first trace. If the first trace does not get inserted at all then how does it show up when I print out the traces at the end of the test?
I printed out the value of partial and it's true on the first insert of PreEpisodeStage as well as the second (if the reset condition is triggered mid episode).
There seems to be nothing wrong with the code and the test is buggy. Fixing the test instead.
Awesome. Merged. Thanks!
Oof, sorry I was on vacations, I could have helped. Brave of you to dig into all that.
Thank you Henri! I learned a lot about Julia in the process.
Trying to get into reinforcement learning now, pun intended!
I have to re-run tests a few times to get this test to fail and it's always the same test.