bhavnicksm / chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
https://pypi.org/project/chonkie/
MIT License
1.67k stars 59 forks source link

[BUG] Newlines are not removed after pre-processing in SemanticChunker #55

Closed Pringled closed 3 days ago

Pringled commented 4 days ago

Describe the bug Currently, raw_sentences includes "\n" strings here. This means that embeddings are created for these "\n" strings which can have a big impact on the semantic chunking when there are a lot of empty newlines (such as in books).

To Reproduce Running the following:

from chonkie import SemanticChunker

chunker = SemanticChunker(chunk_size=512, similarity_threshold=0.7)

text = """This is a text with empty newlines.

Some more text.

"""

chunks = chunker.chunk(text)

And printing raw_sentences here gives the following output: ['This is a text with empty newlines.', '\n', '\n', 'Some more text.', '\n', '\n'].

Expected behavior Newline-only strings should be removed during pre-processing. For example, by doing something like raw_sentences = [sentence for sentence in raw_sentences if sentence.strip()] right after creating raw_sentences. This way, no embeddings are created for these strings and it doesn't affect the semantic chunking.

bhavnicksm commented 4 days ago

Hey @Pringled,

Just merged #58 which is a fix for this issue; could you test it out via an install from source?

It seemed to work decently on the testing example provided above.

Thanks!

Pringled commented 4 days ago

Hey @bhavnicksm,

This looks much better already, thanks! One "edgecase" I found is when you have for example the end of a chapter in a book, or a section in a paper, the last word of the previous one ends up having a lot of newlines appended to it, e.g.

from chonkie import SemanticChunker

chunker = SemanticChunker(chunk_size=512, similarity_threshold=0.7)

text = """
Having restored the condition of time under which all events occur,
we find that a command is executed only when it is related to a
corresponding series of events. Restoring the essential condition of
relation between those who command and those who execute, we find that
by the very nature of the case those who command take the smallest part
in the action itself and that their activity is exclusively directed to
commanding.

CHAPTER VII
"""

chunks = chunker.chunk(text)

Gives

['\n', 'Having restored the condition of time under which all events occur,\n', 'we find that a command is executed only when it is related to a\n', 'corresponding series of events.', ' Restoring the essential condition of\n', 'relation between those who command and those who execute, we find that\n', 'by the very nature of the case those who command take the smallest part\n', 'in the action itself and that their activity is exclusively directed to\n', 'commanding.\n\n\n\n\n\n', 'CHAPTER VII\n']

So in this case, 'commanding.\n\n\n\n\n\n' ends up with many newlines appended to it. Though to to be honest, I would say that this falls under "sensible preprocessing" on the userside, fixing this on the Chonkie side would mean making assumptions about their usecase. Thanks again for fixing the initial issue!

Pringled commented 3 days ago

Closing this, thanks for fixing this on such short notice!