Is left-padding in PPO strictly necessary?

OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)

https://openrlhf.readthedocs.io/

Apache License 2.0

1.73k stars 164 forks source link

Is left-padding in PPO strictly necessary? #232

Open mgerstgrasser opened 4 months ago

mgerstgrasser commented 4 months ago

I noticed that RemoteExperienceMaker left-pads the input sequences even when using vllm for generation: https://github.com/OpenLLMAI/OpenRLHF/blob/dcd379a44eea56625626d1a0832cd3eeda048b21/openrlhf/trainer/ppo_utils/experience_maker.py#L346

I can see that a few lines down,self.actor.process_sequences() assumes this left-padding, as it calculates an action mask in a way that hinges on all the inputs terminating at the same index.

Other than that, are there any other parts of the code that assume that inputs are left-padded, respectively that inputs always terminate on the same index?

If not, I'd like to open a PR that skips the left-padding and calculates action mask directly - the left-padding can be inefficient, and action-masking this way doesn't generalise to multi-turn conversations.

hijkzzz commented 4 months ago

Yes this is necessary to reduce the pad of the training samples

Prompt samples with left padding allow us to remove PAD on both sides and then dynamically padding depending on batch training samples

mgerstgrasser commented 4 months ago

Ah, you mean remove_padding_in_sequences()? Wouldn't that still work with only right-padding?

hijkzzz commented 4 months ago

Ah, you mean remove_padding_in_sequences()? Wouldn't that still work with only right-padding?

This will lead to a lot of pads in the middle

mgerstgrasser commented 4 months ago

Ah, no, to be clear, what I mean is the following: Right now, the padding is done like this ('promp' - a prompt token, 'respo' - a response token):

| [PAD]   [PAD]   promp   promp   promp | respo   respo   [EOS]   [PAD] |
| promp   promp   promp   promp   promp | respo   respo   [EOS]   [PAD] |
| [PAD]   [PAD]   [PAD]   promp   promp | respo   respo   respo   [EOS] |

What I have in mind is instead to do this:

| promp   promp   promp | respo   respo   [EOS]   [PAD]   [PAD] |
| promp   promp   promp   promp   promp | respo   respo   [EOS] |
| promp   promp | respo   respo   respo   [EOS]   [PAD]   [PAD] |

So, less padding overall, and no padding in the middle. The only thing that is now a little different is that the index where the prompt stops and the response starts isn't the same for each sequence - will that break anything?

hijkzzz commented 4 months ago

Ah, no, to be clear, what I mean is the following: Right now, the padding is done like this ('promp' - a prompt token, 'respo' - a response token):
| [PAD]   [PAD]   promp   promp   promp | respo   respo   [EOS]   [PAD] |
| promp   promp   promp   promp   promp | respo   respo   [EOS]   [PAD] |
| [PAD]   [PAD]   [PAD]   promp   promp | respo   respo   respo   [EOS] |
What I have in mind is instead to do this:
| promp   promp   promp | respo   respo   [EOS]   [PAD]   [PAD] |
| promp   promp   promp   promp   promp | respo   respo   [EOS] |
| promp   promp | respo   respo   respo   [EOS]   [PAD]   [PAD] |
So, less padding overall, and no padding in the middle. The only thing that is now a little different is that the index where the prompt stops and the response starts isn't the same for each sequence - will that break anything?

they are

| promp   promp   promp  [PAD]  [PAD]      | respo   respo   [EOS]   [PAD]  
| promp   promp   promp  promp   promp | respo   respo   [EOS] |  [PAD] |
| promp   promp  [PAD]  [PAD]   [PAD]        | respo   respo   respo   [EOS]

mgerstgrasser commented 4 months ago

That's not what I am proposing though! What I mean is, if I return it without the pads in the middle from _generate_vllm(), would that break anything? (No worries if I'm not making sense though, I can just try it and see what happens.)