Closed kmakeev closed 5 years ago
Hi, @kasimte
I've checked your PR, it seems you added a decay version of epsilon
for off-policy algorithms, and I have some suggestions:
actually, you don't need to pass episode
parameter for choose_action
function. just use self.episode
is OK. Because self.epiosde
will be updated when invoking learn
function.
I'm not sure about if we should add exploration in inference mode, or just exploitation. I think we should select the action with complete certainty in inference mode.
thank you for your great help.
I‘m gonna merge this and modify it a little bit later(maybe tonight).
Added "Exploration and exploitation compromise", test for him. Used in dqn model. if accepted, can be applied to all models and used in evaluation.
Example: --gym -a dqn -g -n train_using_gym --gym-env Acrobot-v1 --render-episode 10 --max-step 500 --gym-agents 4
Episode: 128 | step: 159 | last_done_step 159 | rewards: [-121. -115. -110. -158.]
Episode: 129 | step: 144 | last_done_step 144 | rewards: [-103. -134. -82. -143.] Evaluate episode: 129 evaluate number: 100 | average step: 90 | average reward: -89.33 | SOLVED: True
Episode: 130 | step: 110 | last_done_step 110 | rewards: [ -91. -104. -109. -104.]
Process finished with exit code 0