TextSplitter by Character trim leading whitespces

benbrandt / text-splitter

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.

MIT License

245 stars 15 forks source link

Describe the bug TextSplitter by character with Trim=False removes leading whitespaces. This is not the case with HuggingFace tokenizer text spltiter.

To Reproduce

from semantic_text_splitter import TextSplitter
splitter = TextSplitter(20, trim=False, overlap=5)
split_text = splitter.chunks("This is a test document. It has two sentences. Maybe three ? To update your OXE type 'please help me' in the command line interface.")
print(split_text)
['This is a test ', 'test document. ', 'It has two sentences', '. Maybe three ? ', 'To update your OXE ', " OXE type 'please ", " help me' in the ", ' the command line ', 'line interface.']

In list element with index 1, the leading overlaping whitespace in front of test in missing. Causing issues then when we want to calculate overlap and merge texts.

Expected behavior If trim=False, do not remove leading whitespaces. The behaviour of TextSplitter.from_huggingface_tokenizer() is exactly this. Leading whitespaces are not removed.

Desktop (please complete the following information):

OS: [e.g. iOS]: Linux Docker Ubuntu
Version [e.g. 22]: 0.14.1

benbrandt / text-splitter

TextSplitter by Character trim leading whitespces #298