benbrandt / text-splitter

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
MIT License
245 stars 15 forks source link

TextSplitter by Character trim leading whitespces #298

Closed lambda-science closed 1 month ago

lambda-science commented 1 month ago

Describe the bug TextSplitter by character with Trim=False removes leading whitespaces. This is not the case with HuggingFace tokenizer text spltiter.

To Reproduce

from semantic_text_splitter import TextSplitter
splitter = TextSplitter(20, trim=False, overlap=5)
split_text = splitter.chunks("This is a test document. It has two sentences. Maybe three ? To update your OXE type 'please help me' in the command line interface.")
print(split_text)
['This is a test ', 'test document. ', 'It has two sentences', '. Maybe three ? ', 'To update your OXE ', " OXE type 'please ", " help me' in the ", ' the command line ', 'line interface.']

In list element with index 1, the leading overlaping whitespace in front of test in missing. Causing issues then when we want to calculate overlap and merge texts.

Expected behavior If trim=False, do not remove leading whitespaces. The behaviour of TextSplitter.from_huggingface_tokenizer() is exactly this. Leading whitespaces are not removed.

Desktop (please complete the following information):

benbrandt commented 1 month ago

Hi @lambda-science this example actually looks correct to me. You have an overlap of 5 characters, and the last 5 characters of the first chunk are test

The reason it works with Hugging Face is that if the tokenizer has a whitespace prefix setting on (which many do) then there are tokens for many words with and without whitespace, so then including the whitespace doesn't increase the number of tokens, but is the same amount.

It isn't being trimmed, it is just that including the whitespace would go over the allowed character count in the overlap.

If you want the behavior you are describing, you would likely need to use the huggingface tokenizer for your usecase,.