Add support for regular expression in CharacterTextSplitter

File: langchain/text_splitter.py

File: langchain/text_splitter.py Content:

To add support for regular expressions in the CharacterTextSplitter, you can make the following changes:

Import the re module at the beginning of the file.
Update the __init__ method of the CharacterTextSplitter class to accept a use_regex parameter with a default value of False.
Update the split_text method of the CharacterTextSplitter class to use the re.split() function if use_regex is True.

Here's the updated code for the CharacterTextSplitter class:

import re

class CharacterTextSplitter(TextSplitter):
    """Implementation of splitting text that looks at characters."""

    def __init__(self, separator: str = "\n\n", use_regex: bool = False, **kwargs: Any):
        """Create a new TextSplitter."""
        super().__init__(**kwargs)
        self._separator = separator
        self._use_regex = use_regex

    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # First we naively split the large input into a bunch of smaller ones.
        if self._use_regex:
            splits = re.split(self._separator, text)
        elif self._separator:
            splits = text.split(self._separator)
        else:
            splits = list(text)
        return self._merge_splits(splits, self._separator)

With these changes, you can now use regular expressions as separators in the CharacterTextSplitter class.

devstein / langchain

Add support for regular expression in CharacterTextSplitter #11

Feature request

Motivation

Your contribution