Closed troyrock closed 3 years ago
In general, you need to handle reward/statistics you want. So when you create a loop to run your environment & policy, you can just make a variable that stores the reward obtained each timestep from the environment.
That said, there are a couple of classes you can use to make the process simpler. One is the Statistics class, which automatically computes average reward per timestep, and average cumulative reward, including standard deviation. A similar one, which you probably don't need and I'm just mentioning to be sure, is the Experience class, which you can use in model-based RL to learn environments, but also keeps track of things like average reward seen per transition.
Ah, one last thing. In case you are doing planning (say, with value iteration), if you plan for n
timesteps, then the output value function contains effectively the expected values after n
timesteps following the optimal policy, for all states. This does not require to run any experiment.
Thanks for the response. I was able to get exactly what I wanted out of the system. I really appreciate it.
Is it possible to extract the expected value of the reward accrued up to a specific time step? I'm modeling the effect of different drugs on MS patients and would like to be able to extract the accumulated reward (quality adjusted life years) after n time steps (years). Thank you.