feat : DocumentSplitter, adding the option to split_by function

deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.72k stars 1.92k forks source link

feat : DocumentSplitter, adding the option to split_by function #8336

Closed GivAlz closed 2 months ago

GivAlz commented 2 months ago

Proposed Changes:

Adding the possibility to pass a function and personalise the way in which DocumentSplitter defines a unit.

This means a user can, for example, use the following to split and define units:

splitter_function = lambda text: re.split('[\n]{2,}, text)

(or use spacy or anything else).

How did you test it?

Added two tests with two "mock" splitter functions.

Notes for the reviewer

There are some issues related to document splitting #5922 . Given the fact that the current methods are very basic and the issues have been open for moths, I think it would make sense to let the user define how text is split.

CLAassistant commented 2 months ago

All committers have signed the CLA.

vblagoje commented 2 months ago

Hey @GivAlz this is an excellent idea, thank you for opening this PR. Would you please add a reno release note to this PRs branch so we can generate a nice release note about this feature in the upcoming release. See https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes for more details on how to create reno release note 🙏

coveralls commented 2 months ago

Pull Request Test Coverage Report for Build 10831052207

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.01%) to 90.319%

Files with Coverage Reduction	New Missed Lines	%
components/preprocessors/document_splitter.py	2	98.26%
<!--	Total:	2		-->

Totals
Change from base Build 10827600883:	0.01%
Covered Lines:	7202
Relevant Lines:	7974

💛 - Coveralls

GivAlz commented 2 months ago

I've added a release note. Please let me know if I need to modify its name or text content.

Thank you!!

vblagoje commented 2 months ago

@GivAlz thanks for a quick turnaround. To stay consistent let's use all small caps in reno release note name (with - between words). And let's remove highlights as that entry is reserved only for major features we want to highlight to users. Although cool perhaps this feature doesn't cross the highlights threshold this time :-)

vblagoje commented 2 months ago

Looks much better now @GivAlz - to integrate you need to sign the contribution agreement- pretty much standard procedure in most bigger open source projects 🙏

vblagoje commented 2 months ago

The change will be reflected in docs in the upcoming 2.6 version.

vblagoje commented 2 months ago

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

GivAlz commented 2 months ago

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.

vblagoje commented 2 months ago

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.

No worries, I've been doing this for over a year and I forget all the time as well. Now I have pre-commit check notes :-) Please review https://github.com/deepset-ai/haystack/pull/8336/commits/6a592503e454b276cb7106b88c2063a6908ce113 and say if there is something off

GivAlz commented 2 months ago

@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch

Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.

No worries, I've been doing this for over a year and I forget all the time as well. Now I have pre-commit check notes :-) Please review 6a59250 and say if there is something off

LGTM! Just wondering if it makes sense to add a note on the fact that, if the method to_dict is used, the function must be serialisable, but it should be obvious and I think that the error thrown would be pretty clear...if you don't think it is necessary (or maybe a note could be added in the documentation), then I guess it's good to merge!

vblagoje commented 2 months ago

I think we treat it as serializable due to its simple string interface. We do this quite often throughout the codebase - it is ok most likely. I think we can merge this now 🚀