bhavnicksm / chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
https://pypi.org/project/chonkie/
MIT License
1.55k stars 57 forks source link

[Fix] Token counts from Tokenizers and Transformers adding special tokens #52

Closed bhavnicksm closed 2 days ago

bhavnicksm commented 2 days ago

This pull request includes changes to the src/chonkie/chunker/base.py file to update the tokenization process, as well as new tests in the tests/chunker/test_token_chunker.py file to ensure the functionality of the updates. The most important changes include modifications to the tokenizer encoding methods to exclude special tokens and the addition of new test cases for different tokenizers.

Changes to tokenization methods:

Addition of new test cases: