devstein / langchain

⚡ Building applications with LLMs through composability ⚡
MIT License
1 stars 0 forks source link

Add support for regular expression in CharacterTextSplitter #11

Closed devstein closed 1 year ago

devstein commented 1 year ago

Feature request

The current version of CharacterTextSplitter does not provide support for a regular expression in the separator argument.

Motivation

Many documents can be split by slightly different patterns that can be handled by regular expression

Your contribution

dosubot[bot] commented 1 year ago

File: langchain/text_splitter.py

File: langchain/text_splitter.py Content:

To add support for regular expressions in the CharacterTextSplitter, you can make the following changes:

  1. Import the re module at the beginning of the file.
  2. Update the __init__ method of the CharacterTextSplitter class to accept a use_regex parameter with a default value of False.
  3. Update the split_text method of the CharacterTextSplitter class to use the re.split() function if use_regex is True.

Here's the updated code for the CharacterTextSplitter class:

import re

class CharacterTextSplitter(TextSplitter):
    """Implementation of splitting text that looks at characters."""

    def __init__(self, separator: str = "\n\n", use_regex: bool = False, **kwargs: Any):
        """Create a new TextSplitter."""
        super().__init__(**kwargs)
        self._separator = separator
        self._use_regex = use_regex

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # First we naively split the large input into a bunch of smaller ones.
        if self._use_regex:
            splits = re.split(self._separator, text)
        elif self._separator:
            splits = text.split(self._separator)
        else:
            splits = list(text)
        return self._merge_splits(splits, self._separator)

With these changes, you can now use regular expressions as separators in the CharacterTextSplitter class.