-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.
###…
-
Hi there,
I noticed that even though policy net and value net share some parameters (in a3c/estimators.py), their gradient were [clipped](https://github.com/dennybritz/reinforcement-learning/blob/m…
-
### 📚 Documentation
I obtained optimal hyperparameters for training CartPole-v1 from [RLZoo3][1]. I have created a minimal example demonstrating the performance of my CartPole agent. As oer the off…
-
### What happened + What you expected to happen
I assume the error is related to the action space being large because I can not reproduce it when the action space is much smaller (ie 10 times fewer…
AJSVB updated
4 months ago
-
In each policy update step, the penalty function is called with the policy and the current state. this results in a gradient.
Currently i have two ideas for such a penalty function, both need a (co…
-
```
def update():
data = buf.get()
# Get loss and info values before update
pi_l_old, pi_info_old = compute_loss_pi(data)
pi_l_old = pi_l_old.item()
…
-
Got this error when initially ran the policy_gradient. example script.
![metadata_issue](https://github.com/osinenkop/regelum-control/assets/161316313/4a899de9-a9e7-4baa-9969-7ad3c7f0b211)
a1mrz updated
7 months ago
-
I replicated the experiments of pythia28 on hh (Anthropic/hh-rlhf) using the open-source code. Here are some of the experimental results:
**SFT1**:
~~~
python -u train.py exp_name=sft gradient_ac…
-
The rough idea is the following:
- Share policy network.
- Collect experience asynchronously.
- Accumulate gradient.
- Update policy network.
**Related issues:**
#391 #438
-
I am confused by your code.
In the paper, it is mentioned that a policy gradient method [1] is used. But more specifically, I think that is implemented by Actor-Critic.
If I am wrong, plz tell m…