DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.74k stars 1.66k forks source link

[enhancement] Polyak Averaging could be done faster #93

Closed m-rph closed 4 years ago

m-rph commented 4 years ago

This is rather minor, but polyak averaging in DQN/SAC/TD3 could be done faster with far fewer intermediate tensors using torch.addcmul_ https://pytorch.org/docs/stable/torch.html#torch.addcmul.

araffin commented 4 years ago

Why not, do you have some benchmark for that?

I would first run line profiler and see where are the bottlenecks before optimizing things without knowing if they have an real impact.

m-rph commented 4 years ago

Sure, I will benchmark both simply because it's a small change and see if it makes any difference.

araffin commented 4 years ago

After a quick run of line profiler, here are the result for SAC: 9% of the time is passed for collecting rollout and 91% training. In collecting rollout those two lines take the most time:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   359                                                           # Select action randomly or according to policy
   360      2000    1017804.0    508.9     71.8                  action, buffer_action = self._sample_action(learning_starts, action_noise)
   361                                           
   362                                                           # Rescale and perform action
   363      2000     294299.0    147.1     20.8                  new_obs, reward, done, infos = env.step(action)

One is hard to reduce (env.step())

In the train method, it is more evenly separated, the most time consuming operations (each around 10%)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   177      1900    1149457.0    605.0      7.9              actions_pi, log_prob = self.actor.action_log_prob(replay_data.observations)

   202      1900    1017249.0    535.4      7.0                  next_actions, next_log_prob = self.actor.action_log_prob(replay_data.next_observations)

   220      1900    1013137.0    533.2      7.0              critic_loss.backward()
   221      1900    1819593.0    957.7     12.6              self.critic.optimizer.step()

   232      1900    2293261.0   1207.0     15.8              actor_loss.backward()
   233      1900    1254770.0    660.4      8.7              self.actor.optimizer.step()

   237     24700     410007.0     16.6      2.8                  for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
   238     22800     929857.0     40.8      6.4                      target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

So optimizing polyak average sounds good ;)

m-rph commented 4 years ago

I tested these on sac, there is a good 1.5-1.8 speedup here. More on the GPU than the cpu because of data transfers.

def fast_polyak(agent):
  one = th.ones(1, requires_grad=False).to(agent.device)
  for param, target_param in zip(agent.critic.parameters(), agent.critic_target.parameters()):
    target_param.data.mul_(1-agent.tau)
    target_param.data.addcmul_(param.data, one, value=agent.tau)

def slow_polyak(agent):
  for param, target_param in zip(agent.critic.parameters(), agent.critic_target.parameters()):
    target_param.data.copy_((1-agent.tau)*target_param.data + agent.tau*param.data)

# how openai does it in their codebase
def openai_polyak(agent):
  for param, target_param in zip(agent.critic.parameters(), agent.critic_target.parameters()):
    target_param.data.mul_(1-agent.tau)
    target_param.data.add_(agent.tau*param.data)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32,32])).learn(1000)
%timeit  for _ in range(10): fast_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32,32])).learn(1000)
%timeit for _ in range(10):  slow_polyak(agent) 
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32,32])).learn(1000)
%timeit for _ in range(10):  openai_polyak(agent) 

# 100 loops, best of 3: 9.61 ms per loop
# 100 loops, best of 3: 17.4 ms per loop
# 100 loops, best of 3: 12.1 ms per loop

agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[512,512,512])).learn(1000)
%timeit  for _ in range(10): fast_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[512,512,512])).learn(1000)
%timeit for _ in range(10):  slow_polyak(agent) 

# 100 loops, best of 3: 9.55 ms per loop
# 100 loops, best of 3: 17.4 ms per loop
# 100 loops, best of 3: 11.9 ms per loop

agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32]), device='cpu').learn(1000)
%timeit  for _ in range(10): fast_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32]), device='cpu').learn(1000)
%timeit for _ in range(10):  slow_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32]), device='cpu').learn(1000)
%timeit for _ in range(10):  openai_polyak(agent) 

# 100 loops, best of 3: 2.91 ms per loop
# 100 loops, best of 3: 4.58 ms per loop
# 100 loops, best of 3: 3.82 ms per loop

This is actually quite large, at 1Million polyak updates, this shaves off 28 minutes for cpu and 2 hours 11 minutes on GPU.

araffin commented 4 years ago

@PartiallyTyped Could you quickly try on cpu but with num_threads=1 ?

That's the only case where I did not see an improvement yet.