benbrandt / text-splitter

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
MIT License
245 stars 15 forks source link

Bump the minor group with 6 updates #317

Closed dependabot[bot] closed 1 month ago

dependabot[bot] commented 1 month ago

Bumps the minor group with 6 updates:

Package From To
tokenizers 0.19.1 0.20.0
clap 4.5.13 4.5.14
clap_builder 4.5.13 4.5.14
filetime 0.2.23 0.2.24
redox_syscall 0.4.1 0.5.2
ureq 2.10.0 2.10.1

Updates tokenizers from 0.19.1 to 0.20.0

Release notes

Sourced from tokenizers's releases.

Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py : image

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:

>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))

>>> tokenizer Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))

The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:

from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

What's Changed

... (truncated)

Commits
  • a5adaac version 0.20.0
  • a8def07 Merge branch 'fix_release' of github.com:huggingface/tokenizers into branch_v...
  • fe50673 Fix CI
  • b253835 push cargo
  • fc3bb76 update dependencies
  • bfd9cde Perf improvement 16% by removing offsets. (#1587)
  • bd27fa5 add deserialize for pre tokenizers (#1603)
  • 56c9c70 Tests + Deserialization improvement for normalizers. (#1604)
  • 49dafd7 Fix strip python type (#1602)
  • bded212 Support None to reset pre_tokenizers and normalizers, and index sequences (...
  • Additional commits viewable in compare view


Updates clap from 4.5.13 to 4.5.14

Release notes

Sourced from clap's releases.

v4.5.14

[4.5.14] - 2024-08-08

Features

  • (unstable-ext) Added Arg::add for attaching arbitrary state, like completion hints, to Arg without Arg knowing about it
Changelog

Sourced from clap's changelog.

[4.5.14] - 2024-08-08

Features

  • (unstable-ext) Added Arg::add for attaching arbitrary state, like completion hints, to Arg without Arg knowing about it
Commits


Updates clap_builder from 4.5.13 to 4.5.14

Release notes

Sourced from clap_builder's releases.

v4.5.14

[4.5.14] - 2024-08-08

Features

  • (unstable-ext) Added Arg::add for attaching arbitrary state, like completion hints, to Arg without Arg knowing about it
Changelog

Sourced from clap_builder's changelog.

[4.5.14] - 2024-08-08

Features

  • (unstable-ext) Added Arg::add for attaching arbitrary state, like completion hints, to Arg without Arg knowing about it
Commits
  • d222ae4 chore: Release
  • a8abcb4 docs: Update changelog
  • 2690e1b Merge pull request #5621 from shannmu/dynamic_valuehint
  • 7fd7b3e feat(clap_complete): Support to complete custom value of argument
  • fc6aaca Merge pull request #5638 from epage/cargo
  • 631e54b docs(cookbook): Style cargo plugin
  • 6fb49d0 Merge pull request #5636 from gibfahn/styles_const
  • 6f215ee refactor(styles): make styles example use a const
  • bbb2e6f test: Add test case for completing custom value of argument
  • 999071c fix: Change visible to hidden
  • Additional commits viewable in compare view


Updates filetime from 0.2.23 to 0.2.24

Commits


Updates redox_syscall from 0.4.1 to 0.5.2

Updates ureq from 2.10.0 to 2.10.1

Changelog

Sourced from ureq's changelog.

2.10.1

  • default ureq Rustls tls config updated to avoid panic for applications that activate the default Rustls aws-lc-rs feature without setting a process-wide crypto provider. ureq will now use *ring* in this circumstance instead of panicking.
Commits


Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore ` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore ` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore ` will remove the ignore condition of the specified dependency and ignore conditions
codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.50%. Comparing base (6206cc3) to head (7630e5d). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #317 +/- ## ========================================== - Coverage 99.60% 99.50% -0.10% ========================================== Files 11 11 Lines 2037 2037 ========================================== - Hits 2029 2027 -2 - Misses 8 10 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.