-
In PPO training, I would like to apply customized non-parametric reward function, for instance, rule-based rewards based on textual features of generated texts. In this case, I don't need to use rewar…
-
After errors leading me to learn that XCode is also a dependency for this project, I was finally able to compile the cpp files/actions when running `make train_and_show` or `make train`. However, I a…
-
[Reference](https://epolicy.dpss.lacounty.gov/epolicy/epolicy/server/general/projects_responsive/ePolicyMaster/mergedProjects/CalWORKs/CalWORKs/44-211_6_Pregnancy_Special_Need/44-211_6_Pregnancy_Speci…
-
### ❓ Question
In training PPO-Recurrent over different epochs we do not update the LSTM states even though the LSTM weights get updated. Is there a reason to do so? Or is it just to save compute and…
-
### System Info
In the init function of the new ppo trainer (renamed from ppo trainer v2), it says
```
if ref_policy is policy:
raise ValueError(
"`policy` and `ref…
-
Does this support PPO with step-level PRM? Currently I only see scripts for PPO with token-level RM. Specifically, how can we train PPO with [OpenRLHF/Mistral-7b-PRM-Math-Shepherd](https://huggingface…
-
第一阶段的预训练权重怎么来的呀,方便提供一下嘛(/pretrain/ImageNet_premodels/ppo_model.pth)
-
I'm working on a custom PPO agent where the actor learns both the mean and variance of the action distribution. To implement this, I've overridden the `get_action` method and modified the actor's `for…
-
Very nice work!
I'm runing PPO using the hhrlhf datasets in verl repo. And the error is here.
```
File "/home/syx/rlhf/verl/single_controller/ray/base.py", line 395, in func
return getattr…
-
Hi veRL team, thanks for open-sourcing the great framework. I have successfully run the ppo training of qwen2-7b using 2 nodes, so I think there is no problem with my environment. But I encountered an…