Add median token length as limit

MinishLab / model2vec

Distill a Small Static Model from any Sentence Transformer

https://minishlab.github.io/

MIT License

417 stars 18 forks source link

Add median token length as limit #47

Closed stephantul closed 1 month ago

stephantul commented 1 month ago

This PR adds a speed optimization:

For longer texts, tokenization can take up 90-95% of our time, because we tokenize the entire text. However, we usually only take the first N (usually 512) tokens of text. So, it makes sense to only really tokenize the first X tokens of a text. We therefore truncate texts to N * median(token_lengths). A test on wikipedia shows that this doesn't lead to any unnecessary truncation, i.e., it never truncated to a length < 512 tokens.

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Files with missing lines	Coverage Δ
model2vec/model.py	`95.60% <100.00%> (+0.20%)`	:arrow_up: