Closed WillNichols726 closed 5 years ago
Nevermind, I'm dumb. Forgot we needed to get the discounted reward
Well you still can early stop it and use a forward window to approximate the discounted reward. So say you have a gamma value of .99, a batch size of 1024 and a forward window of 512, you would run your environment for 1536 steps, calculate your discounted rewards for all steps then discard the last 512 steps. Your discounted reward would be almost the same as if you ran it for all steps as .99 ** 512 ~0.5%. So while you do lose some experiences it is still favorable to running for a huge amount of time
In get_batch, it looks like the function won't return until an episode is complete. I've got an environment with an extremely long run time (2e5 timesteps) and this is a huge bottleneck for me. Do you see any reason I can't just have my models update when the buffer gets filled?