Closed m-rph closed 4 years ago
Why not, do you have some benchmark for that?
I would first run line profiler and see where are the bottlenecks before optimizing things without knowing if they have an real impact.
Sure, I will benchmark both simply because it's a small change and see if it makes any difference.
After a quick run of line profiler, here are the result for SAC: 9% of the time is passed for collecting rollout and 91% training. In collecting rollout those two lines take the most time:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
359 # Select action randomly or according to policy
360 2000 1017804.0 508.9 71.8 action, buffer_action = self._sample_action(learning_starts, action_noise)
361
362 # Rescale and perform action
363 2000 294299.0 147.1 20.8 new_obs, reward, done, infos = env.step(action)
One is hard to reduce (env.step()
)
In the train method, it is more evenly separated, the most time consuming operations (each around 10%)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
177 1900 1149457.0 605.0 7.9 actions_pi, log_prob = self.actor.action_log_prob(replay_data.observations)
202 1900 1017249.0 535.4 7.0 next_actions, next_log_prob = self.actor.action_log_prob(replay_data.next_observations)
220 1900 1013137.0 533.2 7.0 critic_loss.backward()
221 1900 1819593.0 957.7 12.6 self.critic.optimizer.step()
232 1900 2293261.0 1207.0 15.8 actor_loss.backward()
233 1900 1254770.0 660.4 8.7 self.actor.optimizer.step()
237 24700 410007.0 16.6 2.8 for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
238 22800 929857.0 40.8 6.4 target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
So optimizing polyak average sounds good ;)
I tested these on sac, there is a good 1.5-1.8 speedup here. More on the GPU than the cpu because of data transfers.
def fast_polyak(agent):
one = th.ones(1, requires_grad=False).to(agent.device)
for param, target_param in zip(agent.critic.parameters(), agent.critic_target.parameters()):
target_param.data.mul_(1-agent.tau)
target_param.data.addcmul_(param.data, one, value=agent.tau)
def slow_polyak(agent):
for param, target_param in zip(agent.critic.parameters(), agent.critic_target.parameters()):
target_param.data.copy_((1-agent.tau)*target_param.data + agent.tau*param.data)
# how openai does it in their codebase
def openai_polyak(agent):
for param, target_param in zip(agent.critic.parameters(), agent.critic_target.parameters()):
target_param.data.mul_(1-agent.tau)
target_param.data.add_(agent.tau*param.data)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32,32])).learn(1000)
%timeit for _ in range(10): fast_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32,32])).learn(1000)
%timeit for _ in range(10): slow_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32,32])).learn(1000)
%timeit for _ in range(10): openai_polyak(agent)
# 100 loops, best of 3: 9.61 ms per loop
# 100 loops, best of 3: 17.4 ms per loop
# 100 loops, best of 3: 12.1 ms per loop
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[512,512,512])).learn(1000)
%timeit for _ in range(10): fast_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[512,512,512])).learn(1000)
%timeit for _ in range(10): slow_polyak(agent)
# 100 loops, best of 3: 9.55 ms per loop
# 100 loops, best of 3: 17.4 ms per loop
# 100 loops, best of 3: 11.9 ms per loop
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32]), device='cpu').learn(1000)
%timeit for _ in range(10): fast_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32]), device='cpu').learn(1000)
%timeit for _ in range(10): slow_polyak(agent)
agent = SAC("MlpPolicy", "MountainCarContinuous-v0", policy_kwargs=dict(net_arch=[32,32]), device='cpu').learn(1000)
%timeit for _ in range(10): openai_polyak(agent)
# 100 loops, best of 3: 2.91 ms per loop
# 100 loops, best of 3: 4.58 ms per loop
# 100 loops, best of 3: 3.82 ms per loop
This is actually quite large, at 1Million polyak updates, this shaves off 28 minutes for cpu and 2 hours 11 minutes on GPU.
@PartiallyTyped Could you quickly try on cpu but with num_threads=1
?
That's the only case where I did not see an improvement yet.
This is rather minor, but polyak averaging in DQN/SAC/TD3 could be done faster with far fewer intermediate tensors using
torch.addcmul_
https://pytorch.org/docs/stable/torch.html#torch.addcmul.