-
When attempting to run the [stack_llama](https://github.com/lvwerra/trl/tree/main/examples/stack_llama/scripts) example, I was able to run the first two steps:
`torchrun --nnodes=1 --nproc_per_nod…
-
hi @Ram81 ,
Do you have the model weight based on hm3d dataset?
If you do, can you share with us?
Thanks!
-
Have you released the dataset or how to download the datasets, thanks
-
Firstly, thanks for your innovate and excellent work! I got an error when I try to reproduce the results of the paper (in the pretraining stage).
Would you like to help me please? Of course, I'll try…
-
I tried to deploy PPO and ILQL algorithms with the same bloom3B model under examples/summarize_rlhf/, and changed the reward model to a naive calculation. My GPU is A100 with 32GB.
I need to adjust t…
-
Hi, thanks for the nice job. I try to reproduce the result reported in the paper. However, I didn't find the detail about the training parameters (eg. learning rate, number of epoch) of second stage f…
-
Thanks for the great work! Is it possible that you can share the code of whole RL framework finetuning (Actor & Critic updates based on the reward defined in the paper) for better reproducibility? For…
-
A large part of making the assistant is to teach it to follow instructions. While training using RLHF seems like the main ingredient, there are already prepared supervised instruction-following datase…
-
# Action Plan for ML-Team
### 1. Data mixes
- [ ] create a list of all datasets under consideration for OA SFT, identify datasets that need further processing (e.g. multi-turn and need to be con…
-
At meta level, PPO based RLHF is performing minor adjustments to weights to align with human feedback.
Can we just replace PPO+RLHF with a preference models thats basically a transformer encoder +…