feat: Port NLTKDocumentSplitter from dC to Haystack

vblagoje commented 2 weeks ago

Why:

Introduces a new document splitter component utilizing NLTK for enhanced text processing.

What:

Implemented NLTKDocumentSplitter: A new component that leverages the Natural Language Toolkit (NLTK) for splitting documents based on words, sentences, passages, or pages.
Configuration Options: The splitter supports various configuration parameters such as split_by, split_length, split_overlap, respect_sentence_boundary, language, use_split_rules, and extend_abbreviations for fine-tuning the document splitting process.
Split Mechanics Enhancement: Improvements in split mechanics, including respecting sentence boundaries when splitting by words and addressing special cases with split rules and extended abbreviations for improved tokenization.
Testing: Comprehensive test suite for validating the functionality of the NLTKDocumentSplitter across different splitting scenarios and configurations.

How can it be used:

Dataset Preprocessing: Before feeding text data into machine learning models, use the splitter to preprocess documents into smaller, more manageable sizes or into specific formats required by downstream processing stages.

# Example usage of the NLTKDocumentSplitter component
from haystack.components.preprocessors import NLTKDocumentSplitter

document_splitter = NLTKDocumentSplitter(
    split_by="sentence",
    split_length=10,
    split_overlap=1,
    respect_sentence_boundary=True,
    language="en"
)

# Split documents
split_documents = document_splitter.run(documents=[your_input_document])

How did you test it:

Unit Tests: Implemented unit tests for each critical functionality of the NLTKDocumentSplitter, including various split_by configurations, handling different languages, and respecting sentence boundaries.
Integration Testing: Tested the component within a sample text preprocessing pipeline to ensure it integrates well with other components and handles real-world data as expected.

Notes for the reviewer:

Check if lazy imports are correctly used
This component has been used extensively in dC and is ported from dC into Haystack

coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 10792807724

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
6 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.09%) to 90.398%

Files with Coverage Reduction	New Missed Lines	%
components/generators/azure.py	3	92.68%
components/generators/chat/azure.py	3	92.5%
<!--	Total:	6	-->

Totals
Change from base Build 10774951533:	0.09%
Covered Lines:	7315
Relevant Lines:	8092

💛 - Coveralls

vblagoje commented 2 weeks ago

@davidsbatista you won the lottery here but let's allow @sjrl a first pass to make sure all the pieces were migrated properly 🙏

sjrl commented 2 weeks ago

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

vblagoje commented 2 weeks ago

Forced pushed to properly credit @sjrl for all the work

vblagoje commented 2 weeks ago

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

I'm afraid of unintended side effect for the existing users of DocumentSplitter @sjrl Perhaps we can keep it as is now and carefully merge it for the next release I'd say, wdyt? wdyt @julian-risch ?

vblagoje commented 2 weeks ago

@davidsbatista I converted a few more methods to static, they seems to be really tied to SentenceSplitter and as such I didn't make them free standing

vblagoje commented 2 weeks ago

@sjrl please have another look. I spoke to @julian-risch and he also agreed we integrate NLTKDocumentSplitter and later investigate options to perhaps merge NLTKDocumentSplitter and DocumentSplitter

davidsbatista commented 2 weeks ago

Name                                                          Stmts   Miss  Cover   Missing
-------------------------------------------------------------------------------------------
haystack/components/preprocessors/__init__.py                     5      0   100%
haystack/components/preprocessors/document_cleaner.py           104      2    98%   90, 311
haystack/components/preprocessors/document_splitter.py           96      1    99%   127
haystack/components/preprocessors/nltk_document_splitter.py      98      0   100%
haystack/components/preprocessors/text_cleaner.py                29      0   100%
haystack/components/preprocessors/utils.py                       83     15    82%   91-95, 102-107, 174-176, 202, 208, 212, 230-231
-------------------------------------------------------------------------------------------
TOTAL                                                           415     18    96%

Running the test coverage locally it seems there's a few edge cases in utils.py that might be worth testing. This is what not currently being tested:

174-176 inside _apply_split_rules() tests never go inside the second while loop
202,208,212 _needs_join never falls into a return True case
230-231: _read_abbreviations always falls into the first return case

Do you think it's worth to extend the tests for this edge cases?

vblagoje commented 2 weeks ago

Sure @davidsbatista let's increase coverage and see about compiling those expressions 🙏

vblagoje commented 2 weeks ago

Ah pre-integration checks say we need to add a new documentation page for this component. Not yet ready for integration @davidsbatista @sjrl

vblagoje commented 1 week ago

@dfokina I created an initial version of the doc for this component The main info centers around why someone would choose this splitter over the default one.

vblagoje commented 1 week ago

What prevents us from integrating this PR @davidsbatista and @sjrl ?

davidsbatista commented 1 week ago

to be complete maybe just the docs - but I wouldn't hold the merging because of that

sjrl commented 1 week ago

@vblagoje I'm doing one last quick look over now!

sjrl commented 1 week ago

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

vblagoje commented 1 week ago

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

Ah, no problem, will do - thanks @davidsbatista and @sjrl 🙏

vblagoje commented 1 week ago

Spoke to @tstadel - he waived attributions. Merging this now. @dfokina let's not forget to include this component in 2.6 docs release

deepset-ai / haystack