deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.5k stars 1.9k forks source link

feat: Find alternative for NLTK usage in our DocumentSplitter #5922

Open sjrl opened 1 year ago

sjrl commented 1 year ago

Original issue: https://github.com/deepset-ai/haystack/issues/5675

While we have merged a basic version of TextDocumentSplitter, it doesn't support whitespace cleaning or tokenization so let's keep this issue open. @sjrl You wanted to share your opinion on NLTK usage in the preprocessor?

Originally posted by @julian-risch in https://github.com/deepset-ai/haystack/issues/5675#issuecomment-1737133620

I just wanted to say that I think it is still worth supporting NLTK, however, I think we could also benefit from looking for other options as well. The Sol team has often run into the case that sentence detection on documents that contain things like bullet points and other markdown like elements (e.g. code, Headers, etc.) does not work well. As in those things aren't detected as a separate "sentence" which in the case of bullet points sometimes lead to extremely long documents since all bullet points were grouped into one document.

So I was wondering if it would be possible to look into other processing libraries that might be out there that already natively have better support for this than NLTK. I do realize this might be out of scope for now, but I wanted to bring it up.

Originally posted by @sjrl in https://github.com/deepset-ai/haystack/issues/5675#issuecomment-1737208535

Timoeller commented 1 year ago

I changed the name of the issue to make it more clear we should look into alternatives for the existing NLTK based implementation.

GivAlz commented 1 month ago

I have recently used DocumentSplitter and was not happy about the "split_by=sentence" performance.

I noticed that the current solution is simply splitting at the "." symbol:

def _split_into_units(self, text: str, split_by: Literal["word", "sentence", "passage", "page"]) -> List[str]:
    if split_by == "page":
        self.split_at = "\f"
    elif split_by == "passage":
        self.split_at = "\n\n"
    elif split_by == "sentence":
        self.split_at = "." <--- here
    elif split_by == "word":
        self.split_at = " "
    else:
        raise NotImplementedError(
            "DocumentSplitter only supports 'word', 'sentence', 'page' or 'passage' split_by options."
        )
    units = text.split(self.split_at)
    # Add the delimiter back to all units except the last one
    for i in range(len(units) - 1):
        units[i] += self.split_at
    return units

Would it make sense to usa a regex that splits at multiple symbols, e.g., "." or "?" or "!"? If the objective is to split at bullet points those could be added too...

The other option would be to use something like spacy sentencizer but there might be performance/licensing issues.

If regex is allowed I think that the other split methods could be improved too (e.g. passage could be split at [\n]{2,}).

If that's fine I'd try and make a PR with these changes.


Edit: I noticed issue #8111 ... as an alternative, would it make sense to let the user pass a splitting function or class?

Something like:

anakin87 commented 6 days ago

We have introduced the NLTKDocumentSplitter in Haystack 2.x, so I think we can close this issue.