CarperAI / trlx

A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
MIT License
4.4k stars 470 forks source link

Add tensor padding options to improve hardware utilization #284

Open jon-tow opened 1 year ago

jon-tow commented 1 year ago

🚀 The feature, motivation, and pitch

Recent discussion on Twitter highlighted the importance of tensor padding to improve hardware util. NeMo seems to already support GPU-friendly vocab size padding from discussions with @cat-state but we should also consider adding an optional pad_to_multiple_of arg to our tokenizer calls with the accelerate/transformers backend to satisfy Tensor Core requirements.

image

The implementer should also provide system plots to display any improvements/findings.

Alternatives

No response

Additional context

No response

Rami-Ismael commented 1 year ago

Can I do this? I have experience in building documents for single gpubinplemenation for the NLP model.

jon-tow commented 1 year ago

@Rami-Ismael Sure! Give it a go; I don't believe anyone else is working on this at the moment.

Rami-Ismael commented 1 year ago

I was wondering if it will best to add the pad_to_multiple_of as an argument in the Train Config?

jon-tow commented 1 year ago

We have a TokenizerConfig that could be used to store this option for padding with the tokenizer (and if you wanted to add vocab size padding we could add that to ModelConfig). What do you think?

Rami-Ismael commented 1 year ago

I was looking at the code. I expected all the arguments of the tokenizer Config would be fed into the Hugging Face PreTrain Tokenizer parameter; this will produce an error as the Hugging Face Tokenizer call function has that as an argument.

The vocab size padding will be easy to implement as there is code implementation in megatron project. Is the plan is to have the vocab size to be divisible by the same value as the pad_to_multiple_of or a different value.

jon-tow commented 1 year ago

The options in a TokenizerConfig get picked apart and used where needed. They're not all fed through the AutoTokenizer.from_pretrained as a glob of kwargs, e.g. see how the tokenizer objects are instantiated and assigned fields: https://github.com/CarperAI/trlx/blob/206d885a2fbcbfd848b174714c96c1de903e4f54/trlx/trainer/accelerate_base_trainer.py#L62-L66 So if the pad_to_multiple_of is in TokenizerConfig we might be able to pass it around to tokenizer() calls?

Rami-Ismael commented 1 year ago

For some RL Algorithm that uses the tokenized dialogue https://github.com/CarperAI/trlx/blob/206d885a2fbcbfd848b174714c96c1de903e4f54/trlx/pipeline/offline_pipeline.py#LL25-L28C38 which is only ILQL. In the accelerate_base_trainer.py, there is no tokenizer () calls. Where is the tokenizer() being called in the training loop?

Rami-Ismael commented 1 year ago

Can you add this into this context https://twitter.com/cHHillee/status/1630274804795445248. This issue is being referenced. The Twitter post is from an expert PyTorch, and hardware utilization gives more context to the Andew Karpathy tweet.