Closed kfeeeeee closed 7 years ago
Hi, you need to make sure your model implements chainerrl.recurent.Recurrent
interface so that can be treated as a recurrent model. I guess the easiest way to do it is inheriting chainer.recurrent.RecurrentChainMixin
like
class QFunctionRecurrent(chainer.Chain, StateQFunction, RecurrentChainMixin):
, which will find L.LSTM
by searching recursively in chainer.Chain
and chainer.ChainList
.
Documentation on the usage of recurrent models is almost missing, so I opened another issue for it #83. Thanks for reporting the issue!
Hi, thank you very much for your fast response! That indeed solved the issue.
Just for clarification:
If episodic_replay=true, then:
minibatch_size corresponds to the episodes used for the experience replay
and
episodic_update_len corresponds to the time steps within each of those episodes, right?
Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?
Thanks again!
Unfortunately, my above statement about the issue being solved was a bit hastily.
I modified the the Q Function, i.e.:
class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):
def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
self.n_actions = n_actions
self.n_input_channels = n_input_channels
conv_layers = chainer.ChainList(
L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
L.Convolution2D(32, 64, 4, stride=2, bias=bias),
L.Convolution2D(64, 64, 3, stride=1, bias=bias),
L.Convolution2D(64, 128, 7, stride=1, bias=bias)
)
lstm_layer = L.LSTM(128, 128)
a_stream = MLP(128,n_actions,[2])
v_stream = MLP(128,1,[2])
super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)
def __call__(self, x, test=False):
"""
Args:
x (ndarray or chainer.Variable): An observation
test (bool): a flag indicating whether it is in test mode
"""
h = x
for l in self.conv_layers:
h = F.relu(l(h))
h = self.lstm_layer(h)
batch_size = x.shape[0]
ya = self.a_stream(h, test=test)
mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
ya, mean = F.broadcast(ya,mean)
ya -= mean
ys = self.v_stream(h,test=test)
ya,ys = F.broadcast(ya,ys)
q = ya+ys
return chainerrl.action_value.DiscreteActionValue(q)
with
episodic_replay = True
minibatch_size = 4
episodic_update_len = None
but still, the average_loss is 0, whereas it isn't in the non-recurrent case.
However, there is a good chance that this is due to some bug in my code.
Below I have attached the full source code (without the environment) based on the train_dqn_gym.py which you have provided .
Thanks again!
def main():
import logging
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser()
parser.add_argument('--outdir', type=str, default='dqn_out')
parser.add_argument('--env', type=str, default='Pendulum-v0')
parser.add_argument('--seed', type=int, default=None)
parser.add_argument('--gpu', type=int, default=0)
parser.add_argument('--final-exploration-steps',
type=int, default= 1000*50)
parser.add_argument('--start-epsilon', type=float, default=1.0)
parser.add_argument('--end-epsilon', type=float, default=.05)
parser.add_argument('--demo', action='store_true', default=False)
parser.add_argument('--load', type=str, default=None)
parser.add_argument('--steps', type=int, default=500000)
parser.add_argument('--prioritized-replay', action='store_true')
parser.add_argument('--episodic-replay',type=bool, default=True)
parser.add_argument('--replay-start-size', type=int, default=None)
parser.add_argument('--target-update-frequency', type=int, default=1)
parser.add_argument('--target-update-method', type=str, default='soft')
parser.add_argument('--soft-update-tau', type=float, default=0.001)
parser.add_argument('--update-frequency', type=int, default=1)
parser.add_argument('--eval-n-runs', type=int, default=10)
parser.add_argument('--eval-frequency', type=int, default=50*10)
parser.add_argument('--n-hidden-channels', type=int, default=100)
parser.add_argument('--n-hidden-layers', type=int, default=2)
parser.add_argument('--gamma', type=float, default=0.99)
parser.add_argument('--minibatch-size', type=int, default=None)
parser.add_argument('--render-train', action='store_true')
parser.add_argument('--render-eval', action='store_true')
parser.add_argument('--monitor', action='store_true')
parser.add_argument('--reward-scale-factor', type=float, default=.1)
args = parser.parse_args()
args.outdir = experiments.prepare_output_dir(
args, args.outdir, argv=sys.argv)
print('Output files are saved in {}'.format(args.outdir))
if args.seed is not None:
misc.set_random_seed(args.seed)
def clip_action_filter(a):
return np.clip(a, action_space.low, action_space.high)
def make_env(for_eval):
env = gym.make(args.env)
if args.monitor:
env = gym.wrappers.Monitor(env, args.outdir)
if isinstance(env.action_space, spaces.Box):
misc.env_modifiers.make_action_filtered(env, clip_action_filter)
if not for_eval:
misc.env_modifiers.make_reward_filtered(
env, lambda x: x * args.reward_scale_factor)
if ((args.render_eval and for_eval) or
(args.render_train and not for_eval)):
misc.env_modifiers.make_rendered(env)
return env
env = make_env(for_eval=False)
timestep_limit = env.spec.tags.get(
'wrapper_config.TimeLimit.max_episode_steps')
obs_size = env.observation_space.low.size
action_space = env.action_space
n_actions = action_space.n
class QFunction(chainer.Chain, StateQFunction,RecurrentChainMixin):
def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
self.n_actions = n_actions
self.n_input_channels = n_input_channels
conv_layers = chainer.ChainList(
L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
L.Convolution2D(32, 64, 4, stride=2, bias=bias),
L.Convolution2D(64, 64, 3, stride=1, bias=bias),
L.Convolution2D(64, 128, 7, stride=1, bias=bias)
)
lstm_layer = L.LSTM(128, 128)
a_stream = MLP(128,n_actions,[2])
v_stream = MLP(128,1,[2])
super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)
def __call__(self, x, test=False):
"""
Args:
x (ndarray or chainer.Variable): An observation
test (bool): a flag indicating whether it is in test mode
"""
h = x
for l in self.conv_layers:
h = F.relu(l(h))
h = self.lstm_layer(h)
batch_size = x.shape[0]
ya = self.a_stream(h, test=test)
mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
ya, mean = F.broadcast(ya,mean)
ya -= mean
ys = self.v_stream(h,test=test)
ya,ys = F.broadcast(ya,ys)
q = ya+ys
return chainerrl.action_value.DiscreteActionValue(q)
explorer = explorers.LinearDecayEpsilonGreedy(
args.start_epsilon, args.end_epsilon, args.final_exploration_steps,
action_space.sample)
q_func = QFunction(3,4)
opt = optimizers.Adam()
opt.setup(q_func)
rbuf_capacity = 100000
if args.episodic_replay:
print('episodic replay')
if args.minibatch_size is None:
args.minibatch_size = 4
if args.replay_start_size is None:
args.replay_start_size = 10
if args.prioritized_replay:
betasteps = \
(args.steps - timestep_limit * args.replay_start_size) \
// args.update_frequency
rbuf = replay_buffer.PrioritizedEpisodicReplayBuffer(
rbuf_capacity, betasteps=betasteps)
else:
rbuf = replay_buffer.EpisodicReplayBuffer(rbuf_capacity)
else:
if args.minibatch_size is None:
args.minibatch_size = 32
if args.replay_start_size is None:
args.replay_start_size = 1000
if args.prioritized_replay:
betasteps = (args.steps - args.replay_start_size) \
// args.update_frequency
rbuf = replay_buffer.PrioritizedReplayBuffer(
rbuf_capacity, betasteps=betasteps)
else:
rbuf = replay_buffer.ReplayBuffer(rbuf_capacity)
def phi(obs):
return (np.swapaxes(obs,0,2).astype(np.float32))/255.
gym.undo_logger_setup() # Turn off gym's default logger settings
logging.basicConfig(level=logging.DEBUG, stream=sys.stdout, format='')
agent = DoubleDQN(q_func, opt, rbuf, gpu=args.gpu, gamma=args.gamma,
explorer=explorer, replay_start_size=args.replay_start_size,
target_update_interval=args.target_update_frequency,
update_interval=args.update_frequency,
phi=phi, minibatch_size=args.minibatch_size,
target_update_method=args.target_update_method,
soft_update_tau=args.soft_update_tau,
episodic_update=args.episodic_replay, episodic_update_len=None)
if args.load:
agent.load(args.load)
eval_env = make_env(for_eval=True)
if args.demo:
mean, median, stdev = experiments.eval_performance(
env=eval_env,
agent=agent,
n_runs=args.eval_n_runs,
max_episode_len=50)
print('n_runs: {} mean: {} median: {} stdev'.format(
args.eval_n_runs, mean, median, stdev))
else:
experiments.train_agent_with_evaluation(
agent=agent, env=env, steps=args.steps,
eval_n_runs=args.eval_n_runs, eval_interval=args.eval_frequency,
outdir=args.outdir, eval_env=eval_env,
max_episode_len=50)
if __name__ == '__main__':
main()
@kfeeeeee Can you give me the complete code (including import sentences) and command line arguments?
@muupan Sure thing.
Gym-Environment: gridworld.py
Train-Script: train_dqn_gym.py
And command execution:
python train_dqn_gym.py --env 'Gridworld-v0'
For the rest I use the default args defined in train_dqn_gym.py
Thanks again for looking into this!
Thanks for your code!
Just for clarification:
If episodic_replay=true, then:
minibatch_size corresponds to the episodes used for the experience replay
and
episodic_update_len corresponds to the time steps within each of those episodes, right?
Thus, if one batch has e.g. 50 time steps and episodic_update_len=16 it will draw 16 consecutive time steps from this episodes for replay? Furthermore if episodic_update_len=None it will use all time steps within this episode?
You are correct. minibatch_size
is the number of episodes to sample for an update. Each sample episode's length is at most episodic_update_len
.
As for average_loss
, it turned out to be a bug in ChainerRL. Losses are computed and the model is updated as usual. However, the value of average_loss
is not updated at all when episodic_update=True. I'll open an issue for it and fix it soon. Thanks for reporting it!
Thank you very much for your effort!
Trying this two different q_functions:
(non recurrent)
(recurrent)
I found that for the non-recurrent version the loss is not zero and the agent will eventually master the gym environment provided.
However, changing nothing else than adding an lstm layer and setting episodic_replay to True the average_loss will become 0 all the time and the agents is not able to learn to better interact with its environent.
First, I thought that this was due to some kind of rounding issues so I set the minibatch_size=1, episodic_update_len = 1 (assuming that one episodic replay will now only containg one time step) but still no changes.
I wonder if this is some kind of bug or (which I think is more likely) an error on my side.
Any help is very much appreciated!