Closed drxmy closed 1 year ago
n_steps
- number of interaction steps with environment (each interaction corresponds to observe, take action and get reward )
n_envs
- parallel environments
n_envs
and n_steps
together control of total number of interactions before we make an update using the algorithm. In particular, this means we generate n_envs
sentences using actions from policy up to a length (specified in generation kwargs) and once we reach n_envs x n_steps
interactions, we stop and use the collected interactions to update the policy.
With respect to padding, the padding you see in update function of observation.py is only with respect to context (which holds previously generated tokens) and padding of left is sufficient for this. The padding of input text is controlled by setting padding side in tokenizer (https://github.com/allenai/RL4LMs/blob/d2a8f4ff519df0ba263734bcdb3b7e355ffe8306/scripts/training/task_configs/synthetic_generate_increasing_numbers/gpt2_ppo.yml#L3) which supports both right and left padding.
Hope this provides some clarification.
Thank you! It helps a lot.
Hi @rajcscw. Congrats on the great work!
Another beginner here. Adding a couple more minor related questions:
Thanks for such great work! I am familiar with NLP, but new to RL. What does n_steps mean? Or what does it control? In text generation, does it mean generate n_steps tokens?
For n_envs, why do i need more than 1 env? Does losses from different env will be averaged?
At last i saw that the update function in observation.py only considered left padding. I think it also need a right padding one?
Thanks again and sorry for so much questions.