feat: Enhance DocumentSplitter to support semantic document splitting

deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.91k stars 1.93k forks source link

feat: Enhance DocumentSplitter to support semantic document splitting #8111

Open sjrl opened 4 months ago

sjrl commented 4 months ago

Is your feature request related to a problem? Please describe. Currently the DocumentSplitter in Haystack is relatively basic and recently we have seen that semantic splitting has greatly gained in popularity.

For example, see Partitioning and Chunking in Unstructured.

Or another example is the https://github.com/segment-any-text/wtpsplit package which shows great results for sentence splitting across many languages. This could be used to greatly improve our current sentence splitting in the DocumentSplitter which just uses a period symbol for sentence splits.

Describe the solution you'd like It would be great to enhance Haystack's splitting/chunking strategies to use these new types of methods which have shown to boost the quality of RAG applications.

Additional Context I think doing some research and finding popular libraries (e.g. from the Haystack Discord) would also be a good way to find a good place to start.

vblagoje commented 2 months ago

@sjrl this task landed in my sprint. How about we implement https://x.com/JinaAI_/status/1826649439324254291 I've seen a lot of buzz on x about it and it seems relatively straightforward to implement. LMK

sjrl commented 2 months ago

Hey @vblagoje that approach certainly looks interesting. There doesn't seem to be a standard implementation for it yet so I wonder if something like that deserves to be it's own separate splitter component.

Also FYI I did migrate the sentence splitting from v1 into v2 using the NLTK package in this custom component here. So I think as part of this ticket it could be good to bring this feature to Haystack to improve our sentence splitting capabilities.

vblagoje commented 2 months ago

Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?

sjrl commented 2 months ago

Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?

I think the DeepsetDocumentSplitter should take prio, but up to you if you think the enhanced semantic splitting should be done in a different issue/different time.

vblagoje commented 2 months ago

Yes, I agree DeepsetDocumentSplitter first. The other next sprint or so.

vblagoje commented 2 months ago

I'll leave this open because we haven't actually done semantic splitting but have completed https://github.com/deepset-ai/haystack/pull/8350 instead

cc @julian-risch

davidsbatista commented 1 week ago

A quick overview of Chunking Techniques:

https://www.notion.so/deepsetai/Chunking-Methods-13ee210b37c480b19dafcf766daaae4e

paulmartrencharpro commented 17 hours ago

Both Langchain & llamaindex have a semantic splitter using an embedding (https://blog.lancedb.com/chunking-techniques-with-langchain-and-llamaindex/). From the code (https://github.com/langchain-ai/langchain-experimental/blob/main/libs/experimental/langchain_experimental/text_splitter.py), I see they split by sentence then get the embeddings from it and calculate de cosine distance between each sentence and the next one. If the distance is low, we put the 2 sentences in the same chunk.