Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.43k stars 692 forks source link

feat(chunk): split on sentence boundaries #3484

Open jgen1 opened 1 month ago

jgen1 commented 1 month ago

Problem

The current chunking text-splitting solution is breaking up sentences in the middle of the sentence if that chunk is greater than the max character count. It is nice that it won't break up words because it is using text_splitting_separators, which by default are "\n" and " ", but I would like it to not break up sentences if it can avoid it.

Example

string = "This is a test string. This is the second sentence. Here is the third."

If max_characters = 30 (30th character is the s in "is" in the second sentence): CURRENT TEXT-SPLITTING string_1 = "This is a test string. This is" string_2 = "the second sentence. Here is" string_3 = "the third"

But I would like: PREFERRED TEXT-SPLITTING string_1 = "This is a test string." string_2 = "This is the second sentence." string_3 = "Here is the third."

This kind of splitting is also in line with the idea of unstructured.io -- trying to maintain sections and semantic meaning as best as possible in chunks, constrained by various parameters. It is clearly better in this case to text-split in this way.

Solution

This text_splitting_seperators argument is already a key-word option in the ChunkingOptions class:

@lazyproperty
    def text_splitting_separators(self) -> tuple[str, ...]:
        """Sequence of text-splitting target strings to be used in order of preference."""
        text_splitting_separators_arg = self._kwargs.get("text_splitting_separators")
        return (
            ("\n", " ")
            if text_splitting_separators_arg is None
            else tuple(text_splitting_separators_arg)
        )

So I would like this to be a passable option through the chunk_by_title and/or chunk_elements function. Essentially, to add text_splitting_separators as an argument to the chunking function that defaults to ["\n", " "].

I understand if you do not want to expose this much to the user, who in many cases may not care about this level of detail. If that is the case, is there any way to do "soft" support of it where it may not be directly exposed but I can input it as an argument? Or some other way to allow me to edit this argument without making a fork of the project? Maybe the default should be ["\n", ".", " "]?

@scanny seems like you are the person to talk to about this

jgen1 commented 3 weeks ago

@scanny do you have any thoughts on this?

scanny commented 3 weeks ago

I like the idea of a "sentence boundaries" splitting option. I think we'd have to introduce something like nltk.sent_tokenize() to do that though.

The first problem with just adding "." to .text_splitting_characters is that the splitting characters are assumed to be whitespace and are removed during the splitting process. So the last sentence in each chunk wouldn't have a period after it.

More broadly though I think identifying sentence boundaries is just trickier that a simple regex. This SO answer outlines some of the challenges: https://stackoverflow.com/a/25735848/1902513

If I wanted to experiment though, my first thought would be to mock ChunkingOptions.text_splitting_characters:

from unittest.mock import PropertyMock, patch

_patch = patch.object(
    ChunkingOptions,
    "text_splitting_characters",
    new_callable=PropertyMock,
    return_value=(". " "\n", " "),  # -- etc., whatever you decide --
)
_patch.start()

chunk_things() ...

_patch.stop()

There are probably easier ways to monkey patch, this is just the first thing that comes to mind for me.