Closed smiler80 closed 5 years ago
Hi @bbacem80 Good to know that you are looking into enhancements!
The actions are indeed determined based on the current policy. In that particular example of A2C, the Carla environment used has a continuous action space therefore the actions are sampled from a continuous Multivariate Gaussian distribution whose parameters are learned by the policy.
Below is few more lines of code above the line you quoted to show how the continuous-valued action is selected (sampled) from the action_distribution
learned by the (current) policy.
https://github.com/PacktPublishing/Hands-On-Intelligent-Agents-with-OpenAI-Gym/blob/df9ab3984237b3a02998e2c3d3df482f557945f9/ch8/a2c_agent.py#L140-L144
@bbacem80 : Did my response above answer your questions?
@praveen-palanisamy
Many thanks.
Hello @praveen-palanisamy
I'm now evaluating many strategies of training A2C RL for Carla. Since visual evaluation through tensorboard is not showing the expected progress of action returns, I'm checking parts of code where I could probably enhance.
For example at this level:
https://github.com/PacktPublishing/Hands-On-Intelligent-Agents-with-OpenAI-Gym/blob/df9ab3984237b3a02998e2c3d3df482f557945f9/ch8/a2c_agent.py#L144
It seems that actions are still being sampled randomly while training, aren't they assumed to be predicted by the current policy? Did I misunderstand or miss some details?
Thanks