Off-policy RL algorithms support

allenai / RL4LMs

A modular RL library to fine-tune language models to human preferences

https://rl4lms.apps.allenai.org/

Apache License 2.0

2.18k stars 191 forks source link

Off-policy RL algorithms support #23

Open Div99 opened 1 year ago

Div99 commented 1 year ago

Hi, first of all, great work. This is a very useful library for research on RL and NLP. It will be very helpful if it's possible to add off-policy RL methods like Q-learning, SAC, etc. along with benchmarks.

Also, new offline RL methods applied to NLP like ILQL can be very interesting for human alignment, and support for such methods will further enhance the value of this codebase.

rajcscw commented 1 year ago

Agreed! This is on our to-do list too. If you are interested and have time, you can contribute..

ghadiaravi13 commented 1 year ago

Hi, Probably somewhat related to this, having a forced_decoder_ids argument in the policy.generate() function might help with the offline RL setting, so is there a specific reason to not have that in the current generate function under hf_generation_utils.py? Also, is there a plan to add it in the near future?

rajcscw commented 1 year ago

@ghadiaravi13 I think this is probably because of the transformers library version that we adapted to hf_generation_utils.py. Once we upgrade it to recent versions, we can support this.

ghadiaravi13 commented 1 year ago

Got it, thanks!

Ji4chenLi commented 3 months ago

Hi @rajcscw,

Any update on this issue?

I'm wondering if Q-Learning methods can work for LLM training 🤔 Would be extremely grateful if you can share your experience on this.