-
请问在RLHF过程中,actor,refrence,critic和reward使用的都是7B吗,使用offload了吗,我用的4张80G卡,使用offload的情况下,加载完模型就占用60g了,batch size=4,显存就占满了
-
通常PPO算法需要收集一个episode的数据,计算整个episode的DiscountedReturn/Advantage/GAE,用来更新Critic
在情感分析或者对话任务中,一个episode是什么?
-
**Is your feature request related to a problem? Please describe.**
We should include a tutorial for the SFT. Although we have SteerLM, including a SFT tutorial is important because it is the simple…
-
When I run the inference logic using the following script, I get `RuntimeError: No available kernel. Aborting execution.` error:
```
A100 GPU detected, using flash attention if input tensor is on…
-
**Describe the bug**
When I following the instruction in https://labelstud.io/tags/ranker.html to create a ranker tag, I found it not be displayed in the interface.
**To Reproduce**
Steps to repr…
-
When training the ppo model, I turned on the gradient_checkpointing_enable. If you want to calculate ptx loss, then actor will forward twice. In your code, these two loss are executed backward once se…
-
### 🚀 The feature, motivation, and pitch
Add jax support for RLHF on TPUs.
### Alternatives
_No response_
### Additional context
_No response_
-
**Bug**
Hello ,
I am trying to run summarize_rlhf example using [this blog on wandb](https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzA…
-
### 🐛 Describe the bug
GPU: 8*A6000
CUDA Version: 11.7
Python Version: 3.8.10
colossalai Version: 0.2.8
when I train PPO by
```
torchrun --standalone --nproc_per_node=8 train_prompts.py \
…
-
Hi, @xujz18 @Xiao9905
Thanks for this nice contribution. I noticed that we can load ImageReward data with:
`datasets.load_dataset("THUDM/ImageRewardDB", "8k")`
However, the loaded data seem to…