issues
search
OpenLMLab
/
MOSS-RLHF
MOSS-RLHF
Apache License 2.0
1.3k
stars
101
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
第二篇论文中奖励模型训练的问题
#58
Syaoran1
opened
1 month ago
0
Issue when merging llama with diff to generate English policy model
#57
foxlf823
opened
3 months ago
0
关于 root square of kl divs 与 rewards 的线性关系
#56
shirosheep000
closed
5 months ago
0
RM数据构造
#55
tcxia
opened
8 months ago
1
Has anyone compared this training framework to TRL?
#54
StarrySeas1
opened
8 months ago
1
对第二篇论文中有些不明白的地方请教解惑
#53
Obr00007576
opened
8 months ago
0
论文中提到在PPO流程中可以固定其他模型,先训练reward model直到value loss为0,请问这边具体是怎么进行训练的呢?
#52
HCHCXY
opened
8 months ago
1
Part2中meta dataset的生成
#51
yata0
opened
9 months ago
1
训练集量级咨询
#50
Macvh
opened
9 months ago
1
PPOSFTDataset bug report和相关问题咨询
#49
DZ9
opened
9 months ago
1
关于rm中lm loss计算的疑问
#48
DZ9
opened
9 months ago
1
adding citation of part 2
#47
fakerbaby
closed
8 months ago
0
bash train_ppo_en.sh error
#46
robotzheng
closed
9 months ago
4
论文中rm对比学习训练方法疑问
#45
yhhh777
opened
10 months ago
4
Issues with using the released hh dataset.
#44
jltchiu
opened
10 months ago
2
关于rm模型训练策略与损失函数
#43
tonylin52
opened
10 months ago
12
Clarification on MetaRM-optimization Implementation
#42
Benjamin-eecs
opened
10 months ago
2
release the code for training the reward model
#41
refrain-wbh
closed
10 months ago
0
[Question] Adaptive Margin
#40
eyuansu62
closed
10 months ago
3
请问目前支持基座模型使用Mistral-7b吗
#39
YijuGuo
opened
10 months ago
1
自有的底座模型,自有的SFT权重,重新训练RM,可行么
#38
camposs1979
opened
11 months ago
1
Why are you not releasing your reward model for english?
#37
AmanSinghal927
opened
11 months ago
1
Inference with SFT and Policy EN models
#36
henrypapadatos
opened
1 year ago
1
请问下代码里的kl散度问题
#35
rigorosyangffff
opened
1 year ago
1
合并权重问题
#34
red-tie
opened
1 year ago
7
关于reward model的权重合并问题
#33
HuipengXu
opened
1 year ago
1
资源占用问题
#32
Ming-Di
opened
1 year ago
3
关于reward model的部分的part 2有计划时间节点吗
#31
SpongebBob
opened
1 year ago
13
Any benchmark vs SFT?
#30
guotong1988
opened
1 year ago
2
deepspeed的parameter_offload问题
#29
LiangZhuuu
closed
1 year ago
1
PPO显存占用问题
#28
LiangZhuuu
closed
1 year ago
0
PPO data en
#27
borisshapa
opened
1 year ago
1
关于ppo阶段,reward分数计算的问题
#26
mengyanggithub
opened
1 year ago
5
typo
#25
chosenone75
closed
1 year ago
1
关于中文reward-model参数合并的问题
#24
hannlp
opened
1 year ago
4
关于配置环境
#23
zjutkarma
closed
1 year ago
2
reward model训练的哪些方面的能力
#22
yuanhuachao
opened
1 year ago
1
关于Reward model打分的一些疑惑
#21
hannlp
opened
1 year ago
12
英文的PPOdata
#20
QYHcrossover
opened
1 year ago
1
Training on 8 Nvidia RTX A6000
#19
Top34051
opened
1 year ago
1
value model与reward model
#18
KUANWB
opened
1 year ago
2
PPO训练稳定性问题
#17
hust-kevin
opened
1 year ago
5
训练reward model的脚本
#16
wangzhao88
opened
1 year ago
3
reward_model准确率
#15
mingrenbuke
opened
1 year ago
1
Training script of reward model
#14
zwhe99
closed
1 year ago
2
Technical report PART 2
#13
snowkcon
opened
1 year ago
3
内存占用大问题
#12
QYHcrossover
closed
1 year ago
2
Reward Model
#11
Cyber-Axe
opened
1 year ago
2
关于reward model
#10
skepsun
closed
1 year ago
5
support lora training
#9
akk-123
closed
1 year ago
1
Next