Hi Maxim,
first of all, thank you so much for the book! It helps me a lot for my thesis!
Second, I think that the ExperienceReplayBuffer stores the second-last transition twice, which could bias the training if an environment only has a few steps (like mine).
Maybe I have overlooked something, but this is my minimal example showing the described behaviour:
Notice how the transition from state 2 to 3 is stored twice each time. I used the ptan version that you can get via pip today. Could you have a look into this?
Best regards!
Hi Maxim, first of all, thank you so much for the book! It helps me a lot for my thesis!
Second, I think that the ExperienceReplayBuffer stores the second-last transition twice, which could bias the training if an environment only has a few steps (like mine). Maybe I have overlooked something, but this is my minimal example showing the described behaviour:
The output is:
Notice how the transition from state 2 to 3 is stored twice each time. I used the ptan version that you can get via pip today. Could you have a look into this? Best regards!