-
In the RLHF workflow paper, the Reward Model is used to annotate new data generated by the LLM during the iterative DPO process, resulting in scalar values. According to Algorithm 1, the traditional R…
-
After finishing install successfully, i got this error when ran this command: python e2e_rlhf.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
![捕获](…
-
Does Optimum Neuron have support for [TRL](https://huggingface.co/docs/trl/index) supervised fine-tuning, reward modelling, and PPO using Trainium? Is TRL the best path to support RLHF?
-
Traceback (most recent call last):
File "/mnt/d/ai/RLHF/test.py", line 3, in
tokenizer = AutoTokenizer.from_pretrained("/mnt/d/ai/pretrain_models/pangu", trust_remote_code=True)
File "/hom…
-
Any plans on releasing the DPO code, or a brief intro of how you conducted long-context DPO?
-
這個模型是從llama2 SFT出來的話,看llama2的論文似乎llama2並沒有經過RLHF(llama2-chat有),請問Taiwan llama2有經過RLHF的訓練嗎?如果沒有的話,有關繁體中文的對齊,可以使用RLHF來進行,而非SFT。至於comparison的資料集,可以考慮用ChatGPT來產生,這樣不知有沒有試過,謝謝
![image](https://github.com/…
-
Hi, XTuner Team
Could you please add a citation for the source of the Ray+vLLM-based RLHF architecture - OpenRLHF, such as in the README.md file: https://github.com/InternLM/xtuner?tab=readme-ov-fi…
-
add some rlhf data?
-
**Describe the bug**
Getting the following error only by changing the model to `llava-onevision-qwen2-0_5b-ov` from `llava1_6-mistral-7b-instruct` in the first DPO example [here](https://github.com/m…
-
I just see that we don't have an open issue for RLHF support, yet. I think this is a super important feature since latest models like Llama 2 showed that it's really worthwhile. I can also see that we…
rasbt updated
10 months ago