Closed vblagoje closed 1 week ago
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
components/generators/azure.py | 3 | 92.68% | ||
components/generators/chat/azure.py | 3 | 92.5% | ||
<!-- | Total: | 6 | --> |
Totals | |
---|---|
Change from base Build 10774951533: | 0.09% |
Covered Lines: | 7315 |
Relevant Lines: | 8092 |
@davidsbatista you won the lottery here but let's allow @sjrl a first pass to make sure all the pieces were migrated properly 🙏
Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?
Forced pushed to properly credit @sjrl for all the work
Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?
I'm afraid of unintended side effect for the existing users of DocumentSplitter @sjrl Perhaps we can keep it as is now and carefully merge it for the next release I'd say, wdyt? wdyt @julian-risch ?
@davidsbatista I converted a few more methods to static, they seems to be really tied to SentenceSplitter
and as such I didn't make them free standing
@sjrl please have another look. I spoke to @julian-risch and he also agreed we integrate NLTKDocumentSplitter and later investigate options to perhaps merge NLTKDocumentSplitter and DocumentSplitter
Name Stmts Miss Cover Missing
-------------------------------------------------------------------------------------------
haystack/components/preprocessors/__init__.py 5 0 100%
haystack/components/preprocessors/document_cleaner.py 104 2 98% 90, 311
haystack/components/preprocessors/document_splitter.py 96 1 99% 127
haystack/components/preprocessors/nltk_document_splitter.py 98 0 100%
haystack/components/preprocessors/text_cleaner.py 29 0 100%
haystack/components/preprocessors/utils.py 83 15 82% 91-95, 102-107, 174-176, 202, 208, 212, 230-231
-------------------------------------------------------------------------------------------
TOTAL 415 18 96%
Running the test coverage locally it seems there's a few edge cases in utils.py
that might be worth testing. This is what not currently being tested:
_apply_split_rules()
tests never go inside the second while loop_needs_join
never falls into a return True
case_read_abbreviations
always falls into the first return caseDo you think it's worth to extend the tests for this edge cases?
Sure @davidsbatista let's increase coverage and see about compiling those expressions 🙏
Ah pre-integration checks say we need to add a new documentation page for this component. Not yet ready for integration @davidsbatista @sjrl
@dfokina I created an initial version of the doc for this component The main info centers around why someone would choose this splitter over the default one.
What prevents us from integrating this PR @davidsbatista and @sjrl ?
to be complete maybe just the docs - but I wouldn't hold the merging because of that
@vblagoje I'm doing one last quick look over now!
Thanks @vblagoje this looks great! Just left a few comments.
Also, all code in the utils.py
file was contributed by @tstadel except for the CustomPunktLanguageVars
class. So if possible it would be great to attribute him instead :)
Thanks @vblagoje this looks great! Just left a few comments.
Also, all code in the
utils.py
file was contributed by @tstadel except for theCustomPunktLanguageVars
class. So if possible it would be great to attribute him instead :)
Ah, no problem, will do - thanks @davidsbatista and @sjrl 🙏
Spoke to @tstadel - he waived attributions. Merging this now. @dfokina let's not forget to include this component in 2.6 docs release
Why:
Introduces a new document splitter component utilizing NLTK for enhanced text processing.
What:
split_by
,split_length
,split_overlap
,respect_sentence_boundary
,language
,use_split_rules
, andextend_abbreviations
for fine-tuning the document splitting process.How can it be used:
How did you test it:
split_by
configurations, handling different languages, and respecting sentence boundaries.Notes for the reviewer: