deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.88k stars 1.93k forks source link

Add a Recursive Chunking strategy #8548

Open davidsbatista opened 1 week ago

davidsbatista commented 1 week ago

Use a set of predefined separators to split text recursively. The process follows these steps:

sjrl commented 6 days ago

@davidsbatista This sounds great! One idea I had for this is some way to indicate that we'd like to utilize something like NLTK to do sentence splitting. So normally I think the list of separator characters would look like ["\n\n", ".", " "] to accomplish splitting by paragrah, then sentence, and then by word. And I was wondering if we could replace "." with something like "nltk" or some other tag to indicate we'd like to use a separate algorithm to handle the splitting.

What do you think?

sjrl commented 6 days ago

Also I wanted to ask will the splitting by separators (e.g. ["\n\n", ".", " "]) be handled using a regex splitter? I think supporting regex would be great so we could provide more complicated separators to better handle complex documents and do things like header detection.

davidsbatista commented 6 days ago

that's a good suggestions, I will take it into consideration