deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.93k stars 1.85k forks source link

feat: Port NLTKDocumentSplitter from dC to Haystack #8350

Closed vblagoje closed 1 week ago

vblagoje commented 2 weeks ago

Why:

Introduces a new document splitter component utilizing NLTK for enhanced text processing.

What:

How can it be used:

# Example usage of the NLTKDocumentSplitter component
from haystack.components.preprocessors import NLTKDocumentSplitter

document_splitter = NLTKDocumentSplitter(
    split_by="sentence",
    split_length=10,
    split_overlap=1,
    respect_sentence_boundary=True,
    language="en"
)

# Split documents
split_documents = document_splitter.run(documents=[your_input_document])

How did you test it:

Notes for the reviewer:

coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 10792807724

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details


Files with Coverage Reduction New Missed Lines %
components/generators/azure.py 3 92.68%
components/generators/chat/azure.py 3 92.5%
<!-- Total: 6 -->
Totals Coverage Status
Change from base Build 10774951533: 0.09%
Covered Lines: 7315
Relevant Lines: 8092

💛 - Coveralls
vblagoje commented 2 weeks ago

@davidsbatista you won the lottery here but let's allow @sjrl a first pass to make sure all the pieces were migrated properly 🙏

sjrl commented 2 weeks ago

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

vblagoje commented 2 weeks ago

Forced pushed to properly credit @sjrl for all the work

vblagoje commented 2 weeks ago

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

I'm afraid of unintended side effect for the existing users of DocumentSplitter @sjrl Perhaps we can keep it as is now and carefully merge it for the next release I'd say, wdyt? wdyt @julian-risch ?

vblagoje commented 2 weeks ago

@davidsbatista I converted a few more methods to static, they seems to be really tied to SentenceSplitter and as such I didn't make them free standing

vblagoje commented 2 weeks ago

@sjrl please have another look. I spoke to @julian-risch and he also agreed we integrate NLTKDocumentSplitter and later investigate options to perhaps merge NLTKDocumentSplitter and DocumentSplitter

davidsbatista commented 2 weeks ago
Name                                                          Stmts   Miss  Cover   Missing
-------------------------------------------------------------------------------------------
haystack/components/preprocessors/__init__.py                     5      0   100%
haystack/components/preprocessors/document_cleaner.py           104      2    98%   90, 311
haystack/components/preprocessors/document_splitter.py           96      1    99%   127
haystack/components/preprocessors/nltk_document_splitter.py      98      0   100%
haystack/components/preprocessors/text_cleaner.py                29      0   100%
haystack/components/preprocessors/utils.py                       83     15    82%   91-95, 102-107, 174-176, 202, 208, 212, 230-231
-------------------------------------------------------------------------------------------
TOTAL                                                           415     18    96%

Running the test coverage locally it seems there's a few edge cases in utils.py that might be worth testing. This is what not currently being tested:

Do you think it's worth to extend the tests for this edge cases?

vblagoje commented 2 weeks ago

Sure @davidsbatista let's increase coverage and see about compiling those expressions 🙏

vblagoje commented 2 weeks ago

Ah pre-integration checks say we need to add a new documentation page for this component. Not yet ready for integration @davidsbatista @sjrl

vblagoje commented 1 week ago

@dfokina I created an initial version of the doc for this component The main info centers around why someone would choose this splitter over the default one.

vblagoje commented 1 week ago

What prevents us from integrating this PR @davidsbatista and @sjrl ?

davidsbatista commented 1 week ago

to be complete maybe just the docs - but I wouldn't hold the merging because of that

sjrl commented 1 week ago

@vblagoje I'm doing one last quick look over now!

sjrl commented 1 week ago

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

vblagoje commented 1 week ago

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

Ah, no problem, will do - thanks @davidsbatista and @sjrl 🙏

vblagoje commented 1 week ago

Spoke to @tstadel - he waived attributions. Merging this now. @dfokina let's not forget to include this component in 2.6 docs release