Some agents with online updates fail when used with step_offset of train_agent

The PCL implementation causes a minor error when it is used with step_offset, because of

if self.t - self.t_start == self.t_max:

in pcl.py, while self.t is overwritten by train_agent. To be precise, assert self.t_max is None or self.t - self.t_start <= self.t_max will fail at the beginning of PCL.update_on_policy (https://github.com/toslunar/chainerrl/commit/f8d07b385d11cd63aea03558cfc4eb1db632d370).

The implementations of A3C and ACER seem to have the same issue if trained by train_agent instead of train_agent_async.

chainer / chainerrl