MrSyee / pg-is-all-you-need

Policy Gradient is all you need! A step-by-step tutorial for well-known PG methods.
MIT License
848 stars 119 forks source link

Add sac #4

Closed Curt-Park closed 5 years ago

Curt-Park commented 5 years ago

Added implementation without description. Description of the method will be added in this PR soon.

You can see and run this on SAC, DDPG.

Curt-Park commented 5 years ago

@MrSyee I fixed some issues of DDPG via 355cd13 and ace0ed5 . Colab Link

@mclearning2 hope you see this change for your reference.

Curt-Park commented 5 years ago

@MrSyee @mclearning2 We'd not better use F.tanh or F.sigmoid because they are deprecated at Pytorch 0.4.1. See the PR and release note.

https://github.com/pytorch/pytorch/pull/8748 https://github.com/pytorch/pytorch/releases/tag/v0.4.1

mclearning2 commented 5 years ago

I think it would be better to initialize weights and biases of Actor Critic network in the same way.

def initialize_uniformly(layer: nn.Linear, init_w: float = 3e-3):
    """Initialize the weights and bias in [-init_w, init_w]."""
    layer.weight.data.uniform_(-init_w, init_w)
    layer.bias.data.uniform_(-init_w, init_w)
Curt-Park commented 5 years ago

@mclearning2 I agree

mclearning2 commented 5 years ago

In class ReplayBuffer, done_buf is initialized a different shape, but the result of np.zeros(10) is same with np.zeros([10]). I think I need to change clearly.

self.acts_buf = np.zeros([size], dtype=np.float32)
self.rews_buf = np.zeros([size], dtype=np.float32)
self.done_buf = np.zeros(size, dtype=np.float32)
mclearning2 commented 5 years ago

in update_model() function of Class DDPGAgent, It doesn't need to add .to(device) in here

masks = 1 - done
next_action = self.actor_target(next_state)
next_value = self.critic_target(next_state, next_action)
curr_return = (reward + self.gamma * next_value * masks)#.to(device)
Curt-Park commented 5 years ago

@mclearning2 Clear at 87a61e0

Curt-Park commented 5 years ago

@MrSyee @mclearning2 I Added a full description of SAC. Please review.

mclearning2 commented 5 years ago

I think It's natural to change like this.

In the paper, the authors show that Soft Policy Iteration guarantees convergence based on a tabular setting (4.1), and they extends it to a practical approximation for large continuous domains (4.2). Firstly, the soft value function is trained to minimize the squared residual error:

Curt-Park commented 5 years ago

@mclearning2 I fixed extend but firstly. It doesn't matter at all.

Curt-Park commented 5 years ago

@MrSyee Please review this

Curt-Park commented 5 years ago

@mclearning2 @MrSyee Let me have a score unless there is any issue. I hope you don't lose your attention to the ongoing PRs.

mclearning2 commented 5 years ago

@Curt-Park Okay, I'll try it. I looked over your codes again. There is a minor comment in SACAgent description. You did a good job!

The temperature parameter α determines the relative importance of the entropy term against the reward, and thus controls the schochasticitystochasticity of the optimal policy.

Curt-Park commented 5 years ago

@mclearning2 I will fix it at night today.

Curt-Park commented 5 years ago

@MrSyee thank you man

MrSyee commented 5 years ago

@Curt-Park I'm sorry that I review too late. I fix typo. And I'm testing this agent. After the test, I approve and merge this branch. Thank you.