chainer / chainerrl

ChainerRL is a deep reinforcement learning library built on top of Chainer.
MIT License
1.17k stars 224 forks source link

Argument t_max in PCL misused #236

Open lyx-x opened 6 years ago

lyx-x commented 6 years ago

According to the docstring below, t_max in PCL is equivalent to update_interval in DQN.

https://github.com/chainer/chainerrl/blob/4ca100cd8f0dab5e5e333cf3bdc65fb6e6e30651/chainerrl/agents/pcl.py#L49

However, if t_max == None by default (also in the example script), line 388 (below), the update, will never be executed.

https://github.com/chainer/chainerrl/blob/4b051a4cc0e13634b9b0f521c0ad18db91019d40/chainerrl/agents/pcl.py#L388

In addition, when sampling from the replay buffer, I think that max_len should be set with another parameter, at least not with t_max unless we change the docstring. (I also think there is no such limit on the length of episode in the original paper)

https://github.com/chainer/chainerrl/blob/4b051a4cc0e13634b9b0f521c0ad18db91019d40/chainerrl/agents/pcl.py#L284

muupan commented 6 years ago

When t_max is set to None, PCL is expected to work as the original paper does, i.e. sampling a full episode. When t_max is set to non-None value, it is expected to work like ACER with PCL's loss function.

It seems the docstring of PCL is wrong, not only in t_max but also in other points. We should fix this. Thank you for pointing it out.

lyx-x commented 6 years ago

My pull request is probably not what you expect, I added a new argument for limiting the length of episode. I should probably rename the argument.

By the way, the docstring of t_max in ACER may also be wrong.

https://github.com/chainer/chainerrl/blob/4b051a4cc0e13634b9b0f521c0ad18db91019d40/chainerrl/agents/acer.py#L256