This pull request includes changes to the src/chonkie/chunker/base.py file to update the tokenization process, as well as new tests in the tests/chunker/test_token_chunker.py file to ensure the functionality of the updates. The most important changes include modifications to the tokenizer encoding methods to exclude special tokens and the addition of new test cases for different tokenizers.
Changes to tokenization methods:
src/chonkie/chunker/base.py: Updated the _get_tokenizer_counter, _encode, and _encode_batch methods to exclude special tokens by setting add_special_tokens=False in the encode and batch_encode_plus methods for the "transformers" and "tokenizers" backends. [1][2][3]
Addition of new test cases:
tests/chunker/test_token_chunker.py: Added test_token_chunker_single_token_text_hf and test_token_chunker_single_token_text_tik to test the TokenChunker with single-token text for the "transformers" and "tiktoken" backends, respectively.
This pull request includes changes to the
src/chonkie/chunker/base.py
file to update the tokenization process, as well as new tests in thetests/chunker/test_token_chunker.py
file to ensure the functionality of the updates. The most important changes include modifications to the tokenizer encoding methods to exclude special tokens and the addition of new test cases for different tokenizers.Changes to tokenization methods:
src/chonkie/chunker/base.py
: Updated the_get_tokenizer_counter
,_encode
, and_encode_batch
methods to exclude special tokens by settingadd_special_tokens=False
in theencode
andbatch_encode_plus
methods for the "transformers" and "tokenizers" backends. [1] [2] [3]Addition of new test cases:
tests/chunker/test_token_chunker.py
: Addedtest_token_chunker_single_token_text_hf
andtest_token_chunker_single_token_text_tik
to test theTokenChunker
with single-token text for the "transformers" and "tiktoken" backends, respectively.