Closed fangchuan closed 5 years ago
Hi @fangchuan ,
RL algorithms in general require a lot of samples to learn optimal policies. Depending on the complexity of the environment you used, it may take tens of millions of steps for the A2C agent to learn good/best policy. Note that learning from pixels will take more steps than when using the low-dimensional observation/ state information.
That said, there are ways to improve the sample efficiency of A2C or even use a different algorithm that directly optimizes for the policy (like PPO) that may do well depending on the type of tasks you want to solve. Also, the agent implementation in this repo may not be the most efficient in terms of performance. The code was optimized for usability and readability to help the readers get up to speed quickly.
The tcpserver : error reading message: End of file
messages you see are normal and come from the Carla server-client communication. Although it could have been better, Carla client API (v0.8.x) receives that message everytime the environment is reset.
If you have rendering enabled, you may see the vehicle driving the car in each episode. If you have turned off the rendering, You can infer the progress from the stdout, specifically :mean_ep_rew
(mean episode reward). In your output it is changing between episodes which shows that the agent is learning by driving the car in the environment and receiving some rewards depending on how well it drove.
I really appreciate your reply. @praveen-palanisamy And I have reviewed the code line by line, there are some questions i want to figure out in the a2c_agent.py: 1 in calculate_loss() function:
In the book, it recommends that use the MSE error for the critic network, while the L1 loss was used in the code. And the actor network uses the log odds of actions sampled from the action distribution, I don't think it si a sensible choice, the cross entropy between sampled actions and actions' distribution is widely used in many papers. What's more, could you explain the 'use_entropy bonus' ?
2、About the torch.device():
In the a2c_agent.py, you use torch.device('cpu')
every where. However, in deep.py, they will use torch.device('cuda')
(as the parameters in json configuration file) when building the network structure. I felt a bit confused. Do you mean that we need put the work of actor network and critic network in the GPU, put the calculation of gaussian、loss and the sample operation of actions in the CPU?
@fangchuan : I edited your comment to format the code better so that it is clear.
In this context of Actor-Critic methods and with reference to the code, y
is the td_target
and f(x)
is the critic_prediction
That said, you can un-comment line #182 and comment out line #181 in the code (snippet in your comment) to use MSE loss instead of the Huber loss.
And the actor network uses the log odds of actions sampled from the action distribution, I don't think it si a sensible choice, the cross entropy between sampled actions and actions' distribution is widely used in many papers
The actor loss calculation is done directly using the policy gradient theorem. To help with the understanding of the (n-step Advantage) actor and critic loss implementation in the code, I am pasting a snippet directly from the book page 179, Chapter 8 to provide a summary : "
From the description of the n-step deep actor-critic algorithm we went over previously, you may remember that the critic, represented using a neural network, is trying to solve a problem that is similar to what we saw in Chapter 6 , Implementing an Intelligent Agent for Optimal Discrete Control using Deep Q-Learning, which is to represent the value function (similar to the action-value function we used in this chapter, but a bit simpler). We can use the standard Mean Squared Error (MSE) loss or the smoother L1 loss/Huber loss, calculated based on the critic's predicted values and the n-step returns (TD targets) computed in the previous step.
For the actor, we will use the results obtained with the policy gradient theorem, and specifically the advantage actor-critic version, where the advantage value function is used to guide the gradient updates of the actor policy. We will use the TD_error, which is an unbiased estimate of the advantage value function. In summary, the critic's and actor's losses are as follows: "
That is exactly what is implemented in the following lines: https://github.com/PacktPublishing/Hands-On-Intelligent-Agents-with-OpenAI-Gym/blob/b2f6e78f146276d802e32c8a7d0b3a2f0b698721/ch8/a2c_agent.py#L180-L181
True
. If both conditions are satisfied, the tensors are transferred to the particular CUDA device to do the operations which is especially helpful with the Deep networks used for the value function approximations and the policy.I really appreciate your answer, @praveen-palanisamy
Hi, today i have studied the a2c_agent.py of the actor-critic algorithm, i tested it in several simple environment, and i thought this implementation needs millions steps to get the optimal policy. Then i wanted to try it in the carla-gym environment, but i always got this kind error information:
`Initializing new Carla server... Start pos 36 ([0.0, 3.0]), end 40 ([0.0, 3.0])
Starting new episode... actor0:Episode#:0 ep_reward:0.08275407035052769 mean_ep_rew:0.08275407035052769 best_ep_reward:0.08275407035052769 ERROR: tcpserver 35940 : error reading message: End of file
Start pos 36 ([0.0, 2.0]), end 40 ([-1.0, 2.0]) Starting new episode... actor0:Episode#:1 ep_reward:0.003965873271226895 mean_ep_rew:0.04335997181087729 best_ep_reward:0.08275407035052769 ERROR: tcpserver 35940 : error reading message: End of file
Start pos 36 ([0.0, 2.0]), end 40 ([-1.0, 2.0]) Starting new episode... actor0:Episode#:2 ep_reward:0.010448467731475838 mean_ep_rew:0.03238947045107681 best_ep_reward:0.08275407035052769 ERROR: tcpserver 35940 : error reading message: End of file `
it seemed the program has falled into the loop of resuming, i have reviewed the code, but i failed to locate the bug, could you help me?