batch by maximum amount of tokens

yuvalkirstain commented 2 years ago

🚀 Feature

Provide an option to batch examples by the maximum amount of tokens.

Motivation

Motivation In NLP, the memory consumption of each example is determined by the number of tokens that it possesses. Some examples might be very long (and require a lot of memory), while others might be short (and require little memory). Thus, the batch size should be determined by the number of tokens rather than the number of examples (even if the implementation is somewhat trickier). Imagine an extreme case that there is a single example with 2k tokens, while the rest have 10 tokens. If the max memory consumption is 2k tokens, we will use a batch size of 1. Thus, each epoch will take x200 longer!

Pitch

Let us train language models more efficiently!

Additional context

Fairseq implements this feature but I like Pytorch-Lightning :)

shabie commented 2 years ago

I am willing to work on this but I'd like to know how urgent it is :)

yuvalkirstain commented 2 years ago

@SeanNaren any comment about this? :)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Lightning-Universe / lightning-transformers