Open jgen1 opened 1 month ago
@scanny do you have any thoughts on this?
I like the idea of a "sentence boundaries" splitting option. I think we'd have to introduce something like nltk.sent_tokenize()
to do that though.
The first problem with just adding "."
to .text_splitting_characters
is that the splitting characters are assumed to be whitespace and are removed during the splitting process. So the last sentence in each chunk wouldn't have a period after it.
More broadly though I think identifying sentence boundaries is just trickier that a simple regex. This SO answer outlines some of the challenges: https://stackoverflow.com/a/25735848/1902513
If I wanted to experiment though, my first thought would be to mock ChunkingOptions.text_splitting_characters
:
from unittest.mock import PropertyMock, patch
_patch = patch.object(
ChunkingOptions,
"text_splitting_characters",
new_callable=PropertyMock,
return_value=(". " "\n", " "), # -- etc., whatever you decide --
)
_patch.start()
chunk_things() ...
_patch.stop()
There are probably easier ways to monkey patch, this is just the first thing that comes to mind for me.
Problem
The current chunking text-splitting solution is breaking up sentences in the middle of the sentence if that chunk is greater than the max character count. It is nice that it won't break up words because it is using
text_splitting_separators
, which by default are "\n" and " ", but I would like it to not break up sentences if it can avoid it.Example
string = "This is a test string. This is the second sentence. Here is the third."
If max_characters = 30 (30th character is the s in "is" in the second sentence): CURRENT TEXT-SPLITTING string_1 = "This is a test string. This is" string_2 = "the second sentence. Here is" string_3 = "the third"
But I would like: PREFERRED TEXT-SPLITTING string_1 = "This is a test string." string_2 = "This is the second sentence." string_3 = "Here is the third."
This kind of splitting is also in line with the idea of unstructured.io -- trying to maintain sections and semantic meaning as best as possible in chunks, constrained by various parameters. It is clearly better in this case to text-split in this way.
Solution
This
text_splitting_seperators
argument is already a key-word option in the ChunkingOptions class:So I would like this to be a passable option through the chunk_by_title and/or chunk_elements function. Essentially, to add text_splitting_separators as an argument to the chunking function that defaults to ["\n", " "].
I understand if you do not want to expose this much to the user, who in many cases may not care about this level of detail. If that is the case, is there any way to do "soft" support of it where it may not be directly exposed but I can input it as an argument? Or some other way to allow me to edit this argument without making a fork of the project? Maybe the default should be ["\n", ".", " "]?
@scanny seems like you are the person to talk to about this