deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.72k stars 1.92k forks source link

feat: add `FilterByNumWords` component #8402

Closed aicam closed 1 month ago

aicam commented 1 month ago

Proposed Changes:

FilterByNumWords is used to filter output of retrievers in a pipeline to make sure it does not exceeds LLM input size. This component is added in preprocessors module and can be used in pipelines. It counts number of words simply by counting .split(' ').

Usage:

    from haystack.document_stores.in_memory import InMemoryDocumentStore
    from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
    from haystack.components.preprocessors.filter_by_num_words import FilterByNumWords

    rag_pipeline = Pipeline()
    rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=InMemoryDocumentStore()), name="retriever")
    rag_pipeline.add_component(instance=FilterByNumWords(), name="filter_by_num_words")
    rag_pipeline.connect("retriever", "filter_by_num_words.documents")

How did you test it?

Unit testing added.

Checklist

CLAassistant commented 1 month ago

CLA assistant check
All committers have signed the CLA.

vblagoje commented 1 month ago

@aicam thanks for this contribution. I agree with you this is something that many users might need but due to its simplicity I'm not sure we should include it as the main component as most users can implement, and further customize, such a component themselves.

cc @julian-risch @shadeMe

aicam commented 1 month ago

@aicam thanks for this contribution. I agree with you this is something that many users might need but due to its simplicity I'm not sure we should include it as the main component as most users can implement, and further customize, such a component themselves.

cc @julian-risch @shadeMe

Thank you for your feedback. I understand the concern regarding the simplicity of the component. I'd be happy to work on making the component more customizable, allowing users to tailor it to their specific needs. Would this make it more suitable for inclusion as a main component? I’m open to suggestions on how to enhance its flexibility and usefulness for the broader user base.

vblagoje commented 1 month ago

@aicam thanks for your effort and eagerness to contribute – it’s always great to see enthusiasm from our community! However, after internal discussion, we believe it might not be the best fit as a main component due to its simplicity which most users can implement and customize themselves.

Having said that we’d welcome your skills applied in other tasks. If you’re interested feel free to take a look at our Contributions wanted list where you might find other issue that pique your interest. If in doubt reach out to me or another core developer cc @julian-risch . Looking forward to your future contributions!