For longer texts, tokenization can take up 90-95% of our time, because we tokenize the entire text. However, we usually only take the first N (usually 512) tokens of text. So, it makes sense to only really tokenize the first X tokens of a text. We therefore truncate texts to N * median(token_lengths). A test on wikipedia shows that this doesn't lead to any unnecessary truncation, i.e., it never truncated to a length < 512 tokens.
This PR adds a speed optimization:
For longer texts, tokenization can take up 90-95% of our time, because we tokenize the entire text. However, we usually only take the first N (usually 512) tokens of text. So, it makes sense to only really tokenize the first X tokens of a text. We therefore truncate texts to N * median(token_lengths). A test on wikipedia shows that this doesn't lead to any unnecessary truncation, i.e., it never truncated to a length < 512 tokens.