Open davidsbatista opened 1 week ago
@davidsbatista This sounds great! One idea I had for this is some way to indicate that we'd like to utilize something like NLTK to do sentence splitting. So normally I think the list of separator characters would look like ["\n\n", ".", " "]
to accomplish splitting by paragrah, then sentence, and then by word. And I was wondering if we could replace "."
with something like "nltk"
or some other tag to indicate we'd like to use a separate algorithm to handle the splitting.
What do you think?
Also I wanted to ask will the splitting by separators (e.g. ["\n\n", ".", " "]
) be handled using a regex splitter? I think supporting regex would be great so we could provide more complicated separators to better handle complex documents and do things like header detection.
that's a good suggestions, I will take it into consideration
Use a set of predefined separators to split text recursively. The process follows these steps: