Release v0.20.0: faster encode, better python support
Release v0.20.0
This release is focused on performances and user experience.
Performances:
First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :
Python API
We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Updates the requirements on tokenizers to permit the latest version.
Release notes
Sourced from tokenizers's releases.
... (truncated)
Commits
a5adaac
version 0.20.0a8def07
Merge branch 'fix_release' of github.com:huggingface/tokenizers into branch_v...fe50673
Fix CIb253835
push cargofc3bb76
update dependenciesbfd9cde
Perf improvement 16% by removing offsets. (#1587)bd27fa5
add deserialize for pre tokenizers (#1603)56c9c70
Tests + Deserialization improvement for normalizers. (#1604)49dafd7
Fix strip python type (#1602)bded212
SupportNone
to reset pre_tokenizers and normalizers, and index sequences (...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase
.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show