Closed Curt-Park closed 5 years ago
@MrSyee I fixed some issues of DDPG via 355cd13 and ace0ed5 . Colab Link
@mclearning2 hope you see this change for your reference.
@MrSyee @mclearning2 We'd not better use F.tanh or F.sigmoid because they are deprecated at Pytorch 0.4.1. See the PR and release note.
https://github.com/pytorch/pytorch/pull/8748 https://github.com/pytorch/pytorch/releases/tag/v0.4.1
I think it would be better to initialize weights and biases of Actor Critic network in the same way.
def initialize_uniformly(layer: nn.Linear, init_w: float = 3e-3):
"""Initialize the weights and bias in [-init_w, init_w]."""
layer.weight.data.uniform_(-init_w, init_w)
layer.bias.data.uniform_(-init_w, init_w)
@mclearning2 I agree
In class ReplayBuffer, done_buf
is initialized a different shape, but the result of np.zeros(10)
is same with np.zeros([10])
. I think I need to change clearly.
self.acts_buf = np.zeros([size], dtype=np.float32)
self.rews_buf = np.zeros([size], dtype=np.float32)
self.done_buf = np.zeros(size, dtype=np.float32)
in update_model()
function of Class DDPGAgent
, It doesn't need to add .to(device)
in here
masks = 1 - done
next_action = self.actor_target(next_state)
next_value = self.critic_target(next_state, next_action)
curr_return = (reward + self.gamma * next_value * masks)#.to(device)
@mclearning2 Clear at 87a61e0
@MrSyee @mclearning2 I Added a full description of SAC. Please review.
I think It's natural to change like this.
In the paper, the authors show that Soft Policy Iteration guarantees convergence based on a tabular setting (4.1), and they extend
sit to a practical approximation for large continuous domains (4.2). Firstly, the soft value function is trained to minimize the squared residual error:
@mclearning2 I fixed extend
but firstly
. It doesn't matter at all.
@MrSyee Please review this
@mclearning2 @MrSyee Let me have a score unless there is any issue. I hope you don't lose your attention to the ongoing PRs.
@Curt-Park Okay, I'll try it. I looked over your codes again. There is a minor comment in SACAgent description. You did a good job!
The temperature parameter α determines the relative importance of the entropy term against the reward, and thus controls the
schochasticitystochasticity of the optimal policy.
@mclearning2 I will fix it at night today.
@MrSyee thank you man
@Curt-Park I'm sorry that I review too late. I fix typo. And I'm testing this agent. After the test, I approve and merge this branch. Thank you.
Added implementation without description. Description of the method will be added in this PR soon.
You can see and run this on SAC, DDPG.