Closed gasse closed 3 years ago
Yes, that makes sense. I'm ok with this.
Only minor thing, I have another PR where I'm playing with this exact same code. I think there will be a merge conflict: do you want me to do your suggestion in my PR or we keep them separately and we fix the conflict later?
Or maybe let's merge this PR first, I need to rebase mine anyways I think.
Good idea, I agree that this way of computing makes more sense.
Merged. Thanks !
The same should be done in the C++ file.
Hmmm good point.
In-between two environment steps, the following happens:
From the perspective of an agent, if it can control the observation computation process as well as the action decision process, then those two operations can actually be fused into one, where the observation function directly returns the action which is to be taken.
Consider the following scenario: I use a configuring environment with the solving time as a reward. I want to discard the initial reward, as I do not want to account for the time it takes to load the instance file into SCIP. With the code as it is now, this will also discard the time it takes to compute the initial observation. As such, if I drop the initial reward, smart users might move all the computations required to evaluate their policy into the observation function, which will not be accounted for in the final reward.
The two operations (oi, ai) should be accounted for jointly in-between two environment steps. For example, think of the configuring environment. Previously the time was measured as follows:
It makes more sense, for credit assignment, to account for the time it takes to compute an observation and take a decision with respect to that observation within the same time period: