Closed devstein closed 1 year ago
File: langchain/text_splitter.py
File: langchain/text_splitter.py Content:
To add support for regular expressions in the CharacterTextSplitter, you can make the following changes:
re
module at the beginning of the file.__init__
method of the CharacterTextSplitter
class to accept a use_regex
parameter with a default value of False
.split_text
method of the CharacterTextSplitter
class to use the re.split()
function if use_regex
is True
.Here's the updated code for the CharacterTextSplitter
class:
import re
class CharacterTextSplitter(TextSplitter):
"""Implementation of splitting text that looks at characters."""
def __init__(self, separator: str = "\n\n", use_regex: bool = False, **kwargs: Any):
"""Create a new TextSplitter."""
super().__init__(**kwargs)
self._separator = separator
self._use_regex = use_regex
def split_text(self, text: str) -> List[str]:
"""Split incoming text and return chunks."""
# First we naively split the large input into a bunch of smaller ones.
if self._use_regex:
splits = re.split(self._separator, text)
elif self._separator:
splits = text.split(self._separator)
else:
splits = list(text)
return self._merge_splits(splits, self._separator)
With these changes, you can now use regular expressions as separators in the CharacterTextSplitter
class.
Feature request
The current version of CharacterTextSplitter does not provide support for a regular expression in the separator argument.
Motivation
Many documents can be split by slightly different patterns that can be handled by regular expression
Your contribution