Executing RLBase.plan! after end of experiment

Mytolo commented 1 year ago

Why do we execute RLBase.plan! after the experiment is done? Also we should test if the environment is terminated?

For environments with FULL_ACTION_SET Action_style, this yields an error as the legal_action_mask is empty. Or the policies should be updated s.t. they do not return anything when environment is terminated for plan!? BUT this behavior would also be unexpected.

src\ReinforcementLearningCore\src\policies\agent\multi_agent.jl:133:140

if check_stop(stop_condition, policy, env)
       is_stop = true
       push!(multiagent_policy, PreActStage(), env)
       optimise!(multiagent_policy, PreActStage())
       push!(multiagent_hook, PreActStage(), policy, env)
       RLBase.plan!(multiagent_policy, env)  # let the policy see the last observation
       break
end

In my opinion, this should be completely omitted: Only do

is_stop = true
break

HenriDeh commented 1 year ago

I keep writing answers to this but it's a little confusing.

Imagine the stop_condition is after 1 step. The run loop would be

PreAct: s0
plan!: a0 
<s0, a0> are in the trajectory
optimise!: nothing, we don't have a full step yet
PostAct: r0, t0
check_stop: true, we did 1 step
PreAct: s1
optimise!: nothing because we don't have a full step yet (did not call plan! to push r0 and t0)
plan!: a1
<s0,a0,r0,t0,s1,a1> are in the trajectory

It indeed makes no sense as we did not optimise even though we completed a step. But deleting the content of the if block would also prevent that optimization to happen, if it was setup correctly. Maybe we can move this the the PostEpisodeStage? This would not solve your issue. But we must push the last step in the cache. Maybe the trajectory push should not happen at plan!, to decouple the cache pushing at the action query ? Thoughts @jeremiahpslewis ?

Mytolo commented 1 year ago

In my opinion, this should be included in the normal run loop without the need to take care after the stop_condition is reached. It should be added right after the action was passed to the environment. push!(policy, PostActStage(), env) would be the perfect place for that, right?

Especially for sequential multi agent environment this off-sync (out of sync with environment steps and run loop step) causes a big problem when thinking about trajectory similarity. I recognized for pettingzoo which i included in PR #782: Although there are environments with totally same reward at end of each agents state, the trajectories include different (shifted) rewards.

Mytolo commented 1 year ago

What i basically mean is that the default run cycle of an RL Algorithm should be like:

signals from the environment / action of agent initial state $s_t$ planned action $a_t$

execution on environment of planned action $a_t$

next state $s{t+1}$ reward $r{t+1}$ terminal signal $d_{t+1}$

For policies with trajectories / experience buffer: Use push!(policy, PostActStage(), env) to add tuple $(s_t, at, r{t+1}, s{t+1}, d{t+1})$ to add to trajectory buffer.

For the sake of completness (in src/ReinforcementLearningCore/src/core/run.jl):

use push!(policy, PostActStage(), env) on line 104 and delete everything that is connected to let the policy see the last observation, namely lines 110 - 113 and change behaviour of push!(policy, PostEpisodeStage(), env) for agent / multi_agent. Moreover, to reduce push!(policy, PostEpisodeStage(), env) to just taking care of the last state is in my opinion not adequeat and is missleading: Multiple agents may communicate after each episode or there may be some sort of specail integration of the experience gained from one episode. This command should be deleted also.

I am currently working on 3 Pull Request showing this issue in detail:

Independent Q-Learning for TicTacToe
Adjustment of Pettingzoo to the new mutli agent implementation
Baseline policy for the multi particle environment simple spread

Especially in the last one, we can see that the rewards are not correctly assigned to the agents. Hence, we have to think about a generic way of collecting the signals during run. Otherwise we have to have a look on how the policies take care of it by incorporating strange and not easy understandable hacks like the one with the trajectory here. BUT in my opinion the described way is the cleanest and easiest way.

jeremiahpslewis commented 1 year ago

@HenriDeh Sorry for the delayed response. I agree that the push! sequence doesn't work as well as it might. In my project that depends on RL.jl, I implemented a modified cache and push sequence for SART policies, here: https://github.com/jeremiahpslewis/AlgorithmicCompetition.jl/blob/main/src/AIAPC2020/rt_cache_sart_qlearning.jl

Mytolo commented 1 year ago

@HenriDeh @jeremiahpslewis I edited the comment above. Please feel free to give me feedback on my thoughts. Till this is not resolved, policies may be simply not working because of falsely gathered information of the environments.

HenriDeh commented 1 year ago

Hence, we have to think about a generic way of collecting the signals during run.

I'm currently reworking Trajectories to correctly store transitions (see my mention above). It will have to be accompanied by a change of the current run loop. I will ensure to fix this issue in the process and tag you for a review.

Although I don't understand what you mean by

to just taking care of the last state is in my opinion not adequeat and is missleading: Multiple agents may communicate after each episode or there may be some sort of specail integration of the experience gained from one episode. This command should be deleted also.

I agree that we still need to query the last action for algorithms such as SARSA, but can you elaborate what you mean with multiple agents doing stuff ? Sounds like something which is algorithm-specific and should be dealt with by overloading/specializing the push method.

Mytolo commented 1 year ago

That is just a minor point, but there is a command in the _run code which states: Let the policy see the last observation for this push! method. For sure this is algorithm specific, but in my opinion in this hook there are more possibilities than just seeing the last observation. That is what i meant by missleading.

Mytolo commented 1 year ago

I'm currently reworking Trajectories to correctly store transitions (see my mention above). It will have to be accompanied by a change of the current run loop. I will ensure to fix this issue in the process and tag you for a review.

As i described it? Or in another way? Because i think the previous one was pretty hard to understand how things are stored. That can be done efficiently while keeping readability in my opinion. (And eliminates the source of failures because of not correct stored experience)

HenriDeh commented 1 year ago

Your suggestion involves storing $(s_t, at, r{t+1}, s{t+1}, d{t+1})$ at every step. The problem with that is that is stores each state twice, once when it's the next_state, then a second time when it's the current state. So it'll be a bit different. Probably PreEpisode: store s0 PreAct: plan! act! PostAct: store a0, r0, s1, d1 PostEpisode: nothing.

The cache would be cleared after plan! because we need the cached state from the previous iteration.

jpfr commented 1 year ago

@HenriDeh Sounds good. In that case the if check_stop() code would simplify just as @Mytolo suggested.

Mytolo commented 1 year ago

Your suggestion involves storing (st,at,rt+1,st+1,dt+1) at every step. The problem with that is that is stores each state twice, once when it's the next_state, then a second time when it's the current state. So it'll be a bit different. Probably PreEpisode: store s0 PreAct: plan! act! PostAct: store a0, r0, s1, d1 PostEpisode: nothing.

The cache would be cleared after plan! because we need the cached state from the previous iteration.

But wouldn't this also store the states twice? But what you suggest is what I meant. Feel free to mention me in the PR or if there are any design choices I might also provide help.

JuliaReinforcementLearning / ReinforcementLearning.jl

Executing RLBase.plan! after end of experiment #913