-
-
# Implementing Proximal Policy Optimisation
I've used some of the [PyTorch RFC](https://github.com/pytorch/rfcs/blob/master/README.md) template here for clarity.
**Authors:**
* @salmanmohammadi…
-
By training on hypothetical world models, it could be that we need less data from the original environment . Does our algorithm actually need less samples than a typical RL on the real world model? Us…
-
**Is your feature request related to a problem? Please describe.**
Converting a HF reward model to .nemo doesn't seem to work right now. See discussion in #109 for details.
**Describe the soluti…
-
More of a question than a bug - will you be working on some examples to use unsloth for training Reward Models - https://huggingface.co/docs/trl/main/en/reward_trainer - as well?
armsp updated
3 months ago
-
Hi, thanks for your impactful work. :)
Recently, my coauthors and I submitted a paper, and we found that our model, Gemma-MMPO, shows state-of-the-art results among 7B DPO models (first place when …
-
The Eurus-RM-7b cannot predict the score correctly.
1. I run:
```
from transformers import AutoTokenizer, AutoModel
import torch
def test(model_path):
dataset = [ # cases in webgpt; we …
-
Two new reward models are available: Ray2333/GRM-llama3-8B-distill (https://huggingface.co/Ray2333/GRM-llama3-8B-distill), Ray2333/Gemma-2B-rewardmodel-baseline (https://huggingface.co/Ray2333/Gemma-2…
-
Hello, I followed the steps outlined in "InstructVideo (CVPR 2024)." I'm trying to run the evaluation step: bash configs/instructvideo/eval_generate_videos.sh but I encounter the error below. I checke…
-
Hi, I just follow your architecture and run the code based on https://github.com/Toshihiro-Ota/decision-mamba. But the training time is unacceptable, one epoch needs 8 hours. Do you have any suggestio…