Open jon-tow opened 1 year ago
Can I do this? I have experience in building documents for single gpubinplemenation for the NLP model.
@Rami-Ismael Sure! Give it a go; I don't believe anyone else is working on this at the moment.
I was wondering if it will best to add the pad_to_multiple_of as an argument in the Train Config?
We have a TokenizerConfig
that could be used to store this option for padding with the tokenizer (and if you wanted to add vocab size padding we could add that to ModelConfig
). What do you think?
I was looking at the code. I expected all the arguments of the tokenizer Config would be fed into the Hugging Face PreTrain Tokenizer parameter; this will produce an error as the Hugging Face Tokenizer call function has that as an argument.
The vocab size padding will be easy to implement as there is code implementation in megatron project. Is the plan is to have the vocab size to be divisible by the same value as the pad_to_multiple_of or a different value.
The options in a TokenizerConfig
get picked apart and used where needed. They're not all fed through the AutoTokenizer.from_pretrained
as a glob of kwargs, e.g. see how the tokenizer objects are instantiated and assigned fields:
https://github.com/CarperAI/trlx/blob/206d885a2fbcbfd848b174714c96c1de903e4f54/trlx/trainer/accelerate_base_trainer.py#L62-L66
So if the pad_to_multiple_of
is in TokenizerConfig
we might be able to pass it around to tokenizer()
calls?
For some RL Algorithm that uses the tokenized dialogue https://github.com/CarperAI/trlx/blob/206d885a2fbcbfd848b174714c96c1de903e4f54/trlx/pipeline/offline_pipeline.py#LL25-L28C38 which is only ILQL. In the accelerate_base_trainer.py, there is no tokenizer () calls. Where is the tokenizer() being called in the training loop?
Can you add this into this context https://twitter.com/cHHillee/status/1630274804795445248. This issue is being referenced. The Twitter post is from an expert PyTorch, and hardware utilization gives more context to the Andew Karpathy tweet.
🚀 The feature, motivation, and pitch
Recent discussion on Twitter highlighted the importance of tensor padding to improve hardware util. NeMo seems to already support GPU-friendly vocab size padding from discussions with @cat-state but we should also consider adding an optional
pad_to_multiple_of
arg to our tokenizer calls with the accelerate/transformers backend to satisfy Tensor Core requirements.The implementer should also provide system plots to display any improvements/findings.
Alternatives
No response
Additional context
No response