ShangtongZhang / DeepRL

Modularized Implementation of Deep RL Algorithms in PyTorch
MIT License
3.21k stars 684 forks source link

Option Critic Q value update question #86

Closed spacegoing closed 4 years ago

spacegoing commented 4 years ago

Hi Shangtong,

I was refresh myself with OC's implementation. I found one line of code I am not certain about. Screenshot from 2020-07-18 23-06-50 Accroding to Sutton, 1999 eq21, should the storage.o[i] in the following line: https://github.com/ShangtongZhang/DeepRL/blob/64145c3ae755dbc47bc6b902114600ccd43c808c/deep_rl/agent/OptionCritic_agent.py#L97 be storage.q_o[i-1].gather(1,storage.prev_o[i])?

Many thanks!

spacegoing commented 4 years ago

Sorry my bad. Pls correct me if I'm wrong:

storage.prev_o: o_{t-1}
storage.o: o_t ~ P(o_t|o_{t-1}, s_t)
storage.q_o: O_{\omega_t}

So that the current implementation is correct.

spacegoing commented 4 years ago

Sorry Shangtong, I got one more question.

On this line https://github.com/ShangtongZhang/DeepRL/blob/64145c3ae755dbc47bc6b902114600ccd43c808c/deep_rl/agent/OptionCritic_agent.py#L103 If I'm correct we are using q = $Q(Ot=O{t-1},S_t)$ and v = $V(St)$. However according to option critic paper, shouldn't this be q = $Q(O{t+1}=O{t},S{t+1})$ and v = $V(S_{t+1})$ instead?

ShangtongZhang commented 4 years ago

Mathematically they are the same, but synchronizing the update for policies and termination functions can avoid retain_graph=True

spacegoing commented 4 years ago

That makes sense. Many thanks for your answer:D

spacegoing commented 4 years ago

Hi Shangtong,

Sorry to raise this issue again. Just had a second thought. Since the q and v are only used to calculate beta_adv, which is detached from the graph. We would not have to retain graph when backward.

    with torch.no_grad():
      prediction = self.target_network(self.states)
      epsilon = config.random_option_prob(config.num_workers)
      storage.add(prediction)
      storage.add({'eps': epsilon})
      storage.placeholder()
      # betas: [num_workers]

for rollout_len
      # v: [num_workers, 1]
      # storage.q_o[i].max(dim=-1, keepdim=True).values: [num_workers, 1]
      # storage.q_o[i].mean(-1).unsqueeze(-1): [num_workers, 1]
      v = storage.q_o[i+1].max(
            dim=-1, keepdim=True).values * (1 - storage.eps[i+1]) +\
          storage.q_o[i+1].mean(-1).unsqueeze(-1) * storage.eps[i+1]
      # q: [num_workers, 1]
      q = storage.q_o[i + 1].gather(1, storage.o[i])
      # beta_adv: [num_workers, 1]
      storage.beta_adv[i] = (q - v).detach() + config.beta_reg

Would you think this makes some differences or unnecessary changes? Appreciate your time and patience. Love your thoughts:D

ShangtongZhang commented 4 years ago

They are also used in computing the loss. The root cause is that q, v, and beta share layers. So one of {sync them, retain_graph, two forward pass} is necessary. retain_graph is not quite compatible with the step function.