Closed lambda-science closed 1 month ago
Hi @lambda-science this example actually looks correct to me. You have an overlap of 5 characters, and the last 5 characters of the first chunk are test
The reason it works with Hugging Face is that if the tokenizer has a whitespace prefix setting on (which many do) then there are tokens for many words with and without whitespace, so then including the whitespace doesn't increase the number of tokens, but is the same amount.
It isn't being trimmed, it is just that including the whitespace would go over the allowed character count in the overlap.
If you want the behavior you are describing, you would likely need to use the huggingface tokenizer for your usecase,.
Describe the bug
TextSplitter
by character withTrim=False
removes leading whitespaces. This is not the case with HuggingFace tokenizer text spltiter.To Reproduce
In list element with index 1, the leading overlaping whitespace in front of
test
in missing. Causing issues then when we want to calculate overlap and merge texts.Expected behavior If trim=False, do not remove leading whitespaces. The behaviour of
TextSplitter.from_huggingface_tokenizer()
is exactly this. Leading whitespaces are not removed.Desktop (please complete the following information):