Update tokenizers requirement from <0.20,>=0.10 to >=0.10,<0.21

Updates the requirements on tokenizers to permit the latest version.

Release notes

Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:
from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False
What's Changed

remove enforcement of non special when adding tokens by @ArthurZucker in huggingface/tokenizers#1521

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in huggingface/tokenizers#1513

Make USED_PARALLELISM atomic by @nathaniel-daniel in huggingface/tokenizers#1532

Fixing for clippy 1.78 by @Narsil in huggingface/tokenizers#1548

feat(ci): add trufflehog secrets detection by @McPatate in huggingface/tokenizers#1551

Switch from cached_download to hf_hub_download in tests by @Wauplin in huggingface/tokenizers#1547

Fix "dictionnary" typo by @nprisbrey in huggingface/tokenizers#1511

make sure we don't warn on empty tokens by @ArthurZucker in huggingface/tokenizers#1554

Enable dropout = 0.0 as an equivalent to none in BPE by @mcognetta in huggingface/tokenizers#1550

Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in huggingface/tokenizers#1569

Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in huggingface/tokenizers#1555

Fix clippy + feature test management. by @Narsil in huggingface/tokenizers#1580

Bump spm_precompiled to 0.1.3 by @MikeIvanichev in huggingface/tokenizers#1571

Add benchmark vs tiktoken by @Narsil in huggingface/tokenizers#1582

Fixing the benchmark. by @Narsil in huggingface/tokenizers#1583

Tiny improvement by @Narsil in huggingface/tokenizers#1585

Enable fancy regex by @Narsil in huggingface/tokenizers#1586

Fixing release CI strict (taken from safetensors). by @Narsil in huggingface/tokenizers#1593

Adding some serialization testing around the wrapper. by @Narsil in huggingface/tokenizers#1594

... (truncated)

Commits

a5adaac version 0.20.0
a8def07 Merge branch 'fix_release' of github.com:huggingface/tokenizers into branch_v...
fe50673 Fix CI
b253835 push cargo
fc3bb76 update dependencies
bfd9cde Perf improvement 16% by removing offsets. (#1587)
bd27fa5 add deserialize for pre tokenizers (#1603)
56c9c70 Tests + Deserialization improvement for normalizers. (#1604)
49dafd7 Fix strip python type (#1602)
bded212 Support None to reset pre_tokenizers and normalizers, and index sequences (...
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

LoicGrobol / zeldarose

Update tokenizers requirement from <0.20,>=0.10 to >=0.10,<0.21 #110

Release v0.20.0: faster encode, better python support

Release v0.20.0

Performances:

Python API

What's Changed