Closed MrSyee closed 5 years ago
It doesn't show the rendered gif without executing the notebook cells.
It doesn't show the rendered gif without executing the notebook cells.
I'll add NBviewer link on READMD.md. It works well.
step
function in Agent
has improper comment
# if the last state is not a terminal state, store done as false
In __init__()
of Agent
, a parameter is never used
self.weight_decay = 1e-6
I think action is normalized to (low, high). So this comment is changed from (-1, 1) to (low, high)
ActionNormalizer is an action wrapper class to normalize the action values ranged in (-1. 1). Thanks to this class, we can make the agent simply select action values within the zero centered range (-1, 1)
Well done. Here are some minor comments mainly about the sentences.
The authorconstructs an exploration => The authors construct an exploration
sampled from a noise process N to our actor policy => sampled from a noise process N to the actor policy
The authors used an Ornstein-Uhlenbeck process => The authors used Ornstein-Uhlenbeck process
We used an Ornstein-Uhlenbeck process like the explanation in paper to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia. => Ornstein-Uhlenbeck process generates temporally correlated exploration, and it effectively copes with physical control problems of inertia.
We are going to use two networks for actor and critic. => We are going to use two separated networks for actor and critic.
tanh for output layer => tanh for the output layer
Actor network => the actor network
Like actor, Critic network has three fully connected layers. In a different from actor, it used two non-linearity functions that only ReLU. Also input sizes of critic network are sum of state sizes and action sizes. => On the one hand, the critic network has three fully connected layers, but it uses two activation functions for hidden layers (ReLU). Plus, its input size is the sum of the sizes of the state dimension and the action dimension.
The final layer weights and biases of both actor and critic are initialized from uniform distribution. -> One thing to note is that we initialize the final layer's weights so that they are uniformly distributed.
We implement the action wrapper called ActionNormalizer. It makes range of continuous action which change varience in depends on environment to be (-1, 1). Then, our agent selects only (-1, 1) range action. => ActionNormalizer is an action wrapper class to normalize the action values ranged in (-1, 1). Thanks to this class, we can make the agent simply select action values within the zero centered range (-1, 1).