Open sjrl opened 4 months ago
@sjrl this task landed in my sprint. How about we implement https://x.com/JinaAI_/status/1826649439324254291 I've seen a lot of buzz on x about it and it seems relatively straightforward to implement. LMK
Hey @vblagoje that approach certainly looks interesting. There doesn't seem to be a standard implementation for it yet so I wonder if something like that deserves to be it's own separate splitter component.
Also FYI I did migrate the sentence splitting from v1 into v2 using the NLTK package in this custom component here. So I think as part of this ticket it could be good to bring this feature to Haystack to improve our sentence splitting capabilities.
Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?
Ok, so what you are suggesting @sjrl is to implement DeepsetDocumentSplitter for this ticket and then enhanced semantic paragraph splitting via JinaAI in another issue?
I think the DeepsetDocumentSplitter should take prio, but up to you if you think the enhanced semantic splitting should be done in a different issue/different time.
Yes, I agree DeepsetDocumentSplitter first. The other next sprint or so.
I'll leave this open because we haven't actually done semantic splitting but have completed https://github.com/deepset-ai/haystack/pull/8350 instead
cc @julian-risch
A quick overview of Chunking Techniques:
Both Langchain & llamaindex have a semantic splitter using an embedding (https://blog.lancedb.com/chunking-techniques-with-langchain-and-llamaindex/). From the code (https://github.com/langchain-ai/langchain-experimental/blob/main/libs/experimental/langchain_experimental/text_splitter.py), I see they split by sentence then get the embeddings from it and calculate de cosine distance between each sentence and the next one. If the distance is low, we put the 2 sentences in the same chunk.
Is your feature request related to a problem? Please describe. Currently the
DocumentSplitter
in Haystack is relatively basic and recently we have seen that semantic splitting has greatly gained in popularity.For example, see Partitioning and Chunking in Unstructured.
Or another example is the https://github.com/segment-any-text/wtpsplit package which shows great results for sentence splitting across many languages. This could be used to greatly improve our current sentence splitting in the
DocumentSplitter
which just uses a period symbol for sentence splits.Describe the solution you'd like It would be great to enhance Haystack's splitting/chunking strategies to use these new types of methods which have shown to boost the quality of RAG applications.
Additional Context I think doing some research and finding popular libraries (e.g. from the Haystack Discord) would also be a good way to find a good place to start.