Closed spacegoing closed 4 years ago
Sorry my bad. Pls correct me if I'm wrong:
storage.prev_o: o_{t-1}
storage.o: o_t ~ P(o_t|o_{t-1}, s_t)
storage.q_o: O_{\omega_t}
So that the current implementation is correct.
Sorry Shangtong, I got one more question.
On this line https://github.com/ShangtongZhang/DeepRL/blob/64145c3ae755dbc47bc6b902114600ccd43c808c/deep_rl/agent/OptionCritic_agent.py#L103 If I'm correct we are using q = $Q(Ot=O{t-1},S_t)$ and v = $V(St)$. However according to option critic paper, shouldn't this be q = $Q(O{t+1}=O{t},S{t+1})$ and v = $V(S_{t+1})$ instead?
Mathematically they are the same, but synchronizing the update for policies and termination functions can avoid retain_graph=True
That makes sense. Many thanks for your answer:D
Hi Shangtong,
Sorry to raise this issue again. Just had a second thought. Since the q
and v
are only used to calculate beta_adv
, which is detached from the graph. We would not have to retain graph when backward.
with torch.no_grad():
prediction = self.target_network(self.states)
epsilon = config.random_option_prob(config.num_workers)
storage.add(prediction)
storage.add({'eps': epsilon})
storage.placeholder()
# betas: [num_workers]
for rollout_len
# v: [num_workers, 1]
# storage.q_o[i].max(dim=-1, keepdim=True).values: [num_workers, 1]
# storage.q_o[i].mean(-1).unsqueeze(-1): [num_workers, 1]
v = storage.q_o[i+1].max(
dim=-1, keepdim=True).values * (1 - storage.eps[i+1]) +\
storage.q_o[i+1].mean(-1).unsqueeze(-1) * storage.eps[i+1]
# q: [num_workers, 1]
q = storage.q_o[i + 1].gather(1, storage.o[i])
# beta_adv: [num_workers, 1]
storage.beta_adv[i] = (q - v).detach() + config.beta_reg
Would you think this makes some differences or unnecessary changes? Appreciate your time and patience. Love your thoughts:D
They are also used in computing the loss. The root cause is that q, v, and beta share layers. So one of {sync them, retain_graph, two forward pass} is necessary. retain_graph is not quite compatible with the step
function.
Hi Shangtong,
I was refresh myself with OC's implementation. I found one line of code I am not certain about. Accroding to Sutton, 1999 eq21, should the
storage.o[i]
in the following line: https://github.com/ShangtongZhang/DeepRL/blob/64145c3ae755dbc47bc6b902114600ccd43c808c/deep_rl/agent/OptionCritic_agent.py#L97 bestorage.q_o[i-1].gather(1,storage.prev_o[i])
?Many thanks!