Because of the way PyTorch stores computational graph, it is sufficient to simply run the network once when generating the trajectory and store the output, and compute the loss on that at each training step. This is pointlessly doubling the computational cost.
🚀 Feature
There seems to be fair few inefficiencies in the RL model code.
In both the VPG and DQN code, the network is computed twice, once to generate the trajectory and then once again in the loss function.
https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/datamodules/experience_source.py#L165
https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/rl/vanilla_policy_gradient_model.py#L146 https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/losses/rl.py#L35
Because of the way PyTorch stores computational graph, it is sufficient to simply run the network once when generating the trajectory and store the output, and compute the loss on that at each training step. This is pointlessly doubling the computational cost.
Furthermore, in both the VPG and DQN code, multi-envs are allowed, but no parallelrisation is applied across them. This takes away significant proportion of the advantage of using multi-envs in the first place https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/master/pl_bolts/models/rl/vanilla_policy_gradient_model.py#L202