Extract reward before observation

gasse commented 3 years ago

In-between two environment steps, the following happens:

features are computed (observation function code)
an action is taken (user code)

From the perspective of an agent, if it can control the observation computation process as well as the action decision process, then those two operations can actually be fused into one, where the observation function directly returns the action which is to be taken.

Consider the following scenario: I use a configuring environment with the solving time as a reward. I want to discard the initial reward, as I do not want to account for the time it takes to load the instance file into SCIP. With the code as it is now, this will also discard the time it takes to compute the initial observation. As such, if I drop the initial reward, smart users might move all the computations required to evaluate their policy into the observation function, which will not be accounted for in the final reward.

The two operations (oi, ai) should be accounted for jointly in-between two environment steps. For example, think of the configuring environment. Previously the time was measured as follows:

t0
the environment transitions to s0
observation o0 is computed
t1
reward r0=(t1-t0) is returned, accounts for s0 and o0
action a0 is taken
the environment transitions to s1
observation o1 is computed
t2
reward r1=(t2-t1) is returned, accounts for a0, s1 and o1

It makes more sense, for credit assignment, to account for the time it takes to compute an observation and take a decision with respect to that observation within the same time period:

t0
- the environment transitions to s0
- t1
- reward r0=(t1-t0) is returned, accounts for s0
- observation o0 is computed
- action a0 is taken
- the environment transitions to s1
- t2
- reward r1=(t2-t1) is returned, accounts for o0, a0 and s1
- observation o1 is computed

dchetelat commented 3 years ago

Yes, that makes sense. I'm ok with this.

Only minor thing, I have another PR where I'm playing with this exact same code. I think there will be a merge conflict: do you want me to do your suggestion in my PR or we keep them separately and we fix the conflict later?

dchetelat commented 3 years ago

Or maybe let's merge this PR first, I need to rebase mine anyways I think.

jdumouchelle commented 3 years ago

Good idea, I agree that this way of computing makes more sense.

gasse commented 3 years ago

Merged. Thanks !

AntoinePrv commented 3 years ago

The same should be done in the C++ file.

gasse commented 3 years ago

Hmmm good point.

ds4dm / ecole

Extract reward before observation #209