allenai / RL4LMs

A modular RL library to fine-tune language models to human preferences
https://rl4lms.apps.allenai.org/
Apache License 2.0
2.18k stars 191 forks source link

Some questions about n_steps,n_envs and padding_side. #8

Closed drxmy closed 1 year ago

drxmy commented 1 year ago

Thanks for such great work! I am familiar with NLP, but new to RL. What does n_steps mean? Or what does it control? In text generation, does it mean generate n_steps tokens?

For n_envs, why do i need more than 1 env? Does losses from different env will be averaged?

At last i saw that the update function in observation.py only considered left padding. I think it also need a right padding one?

Thanks again and sorry for so much questions.

rajcscw commented 1 year ago

n_steps - number of interaction steps with environment (each interaction corresponds to observe, take action and get reward ) n_envs - parallel environments

n_envs and n_steps together control of total number of interactions before we make an update using the algorithm. In particular, this means we generate n_envs sentences using actions from policy up to a length (specified in generation kwargs) and once we reach n_envs x n_steps interactions, we stop and use the collected interactions to update the policy.

With respect to padding, the padding you see in update function of observation.py is only with respect to context (which holds previously generated tokens) and padding of left is sufficient for this. The padding of input text is controlled by setting padding side in tokenizer (https://github.com/allenai/RL4LMs/blob/d2a8f4ff519df0ba263734bcdb3b7e355ffe8306/scripts/training/task_configs/synthetic_generate_increasing_numbers/gpt2_ppo.yml#L3) which supports both right and left padding.

Hope this provides some clarification.

drxmy commented 1 year ago

Thank you! It helps a lot.

kushalchawla commented 1 year ago

Hi @rajcscw. Congrats on the great work!

Another beginner here. Adding a couple more minor related questions:

  1. Verify: Each interaction consists of using one of the sentences (or one of the envs), predicting the next token (the action), and getting the reward (via completing the sentence using rollouts, and then computing the reward metrics).
  2. Thanks for the description above. Could you please build on that to include the role of alg/args/n_epochs and train_evaluation/n_iters, plus their difference with the epochs and iterations used in a standard ML / NLP task?