OoriData / OgbujiPT

Client-side toolkit for using large language models, including where self-hosted
Apache License 2.0
101 stars 8 forks source link

chunk_overlap should be able to be 0 or not passed in #30

Closed choccccy closed 3 months ago

choccccy commented 1 year ago

There are some situations where having lots of chunk overlap isn't super useful; being able to set it as 0 (or, perhaps, just not set it, and have it assume it should be 0) would be nice.

I've also thought that maybe the chunk overlap should default to a percentage of the chunk size.

uogbuji commented 1 year ago

Yeah, as you reason through it, I agree. Default to 0.

choccccy commented 1 year ago

The issue specifically is that the chunks are overlapping by about two "sections"/words when tested with " " as the seperator, so when I put in the classic lorem ipsum:

Lorem Ipsum ``` Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce vestibulum nisl eget mauris malesuada, quis facilisis arcu vehicula. Sed consequat, quam ut auctor volutpat, augue ex tincidunt massa, in varius nulla ex vel ipsum. Nullam vitae eros nec ante sagittis luctus. Nullam scelerisque dolor eu orci iaculis, at convallis nulla luctus. Praesent eget ex id arcu facilisis varius vel id neque. Donec non orci eget elit aliquam tempus. Sed at tortor at tortor congue dictum. Nulla varius erat at libero lacinia, id dignissim risus auctor. Ut eu odio vehicula, tincidunt justo ac, viverra erat. Sed nec sem sit amet erat malesuada finibus. Nulla sit amet diam nec dolor tristique dignissim. Sed vehicula, justo nec posuere eleifend, libero ligula interdum neque, at lacinia arcu quam non est. Integer aliquet, erat id dictum euismod, felis libero blandit lorem, nec ullamcorper quam justo at elit. ```

and assign a chunk size of 100 and overlap of 0, I would expect that my chunks would look like this:

Expected result ```python chunks[0]: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce vestibulum nisl eget mauris malesuada,' chunks[1]: 'quis facilisis arcu vehicula. Sed consequat, quam ut auctor volutpat, augue ex tincidunt massa, in varius' chunks[2]: 'nulla ex vel ipsum. Nullam vitae eros nec ante sagittis luctus. Nullam scelerisque dolor eu orci iaculis,' chunks[3]: 'at convallis nulla luctus. Praesent eget ex id arcu facilisis varius vel id neque. Donec non orci eget' chunks[4]: 'elit aliquam tempus. Sed at tortor at tortor congue dictum. Nulla varius erat at libero lacinia, id dignissim' chunks[5]: 'risus auctor. Ut eu odio vehicula, tincidunt justo ac, viverra erat. Sed nec sem sit amet erat malesuada' chunks[6]: 'finibus. Nulla sit amet diam nec dolor tristique dignissim. Sed vehicula, justo nec posuere eleifend,' chunks[7]: 'libero ligula interdum neque, at lacinia arcu quam non est. Integer aliquet, erat id dictum euismod, felis' chunks[8]: 'libero blandit lorem, nec ullamcorper quam justo at elit.' ```

but instead, when tested, they overlap by about 2 words (after the first and presumably before the last chunk):

Actual result ```python chunk[0]: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce vestibulum nisl eget mauris malesuada,' chunk[1]: 'mauris malesuada, quis facilisis arcu vehicula. Sed consequat, quam ut auctor volutpat, augue ex tincidunt massa,' chunk[2]: 'tincidunt massa, in varius nulla ex vel ipsum. Nullam vitae eros nec ante sagittis luctus. Nullam scelerisque dolor' chunk[3]: 'scelerisque dolor eu orci iaculis, at convallis nulla luctus. Praesent eget ex id arcu facilisis varius vel id neque.' chunk[4]: 'id neque. Donec non orci eget elit aliquam tempus. Sed at tortor at tortor congue dictum. Nulla varius erat' chunk[5]: 'varius erat at libero lacinia, id dignissim risus auctor. Ut eu odio vehicula, tincidunt justo ac, viverra erat.' chunk[6]: 'viverra erat. Sed nec sem sit amet erat malesuada finibus. Nulla sit amet diam nec dolor tristique dignissim.' chunk[7]: 'tristique dignissim. Sed vehicula, justo nec posuere eleifend, libero ligula interdum neque, at lacinia arcu quam' chunk[8]: 'arcu quam non est. Integer aliquet, erat id dictum euismod, felis libero blandit lorem, nec ullamcorper quam' ```
uogbuji commented 4 months ago

I'm nudged by an article I'm working on to look at this. Here is my plan, in brief:

uogbuji commented 4 months ago

Actually, after another look, I'll create text_split_fuzzy and text_split as generators, then leave text_splitter as before, though deprecated.