Closed GivAlz closed 2 months ago
Hey @GivAlz this is an excellent idea, thank you for opening this PR. Would you please add a reno release note to this PRs branch so we can generate a nice release note about this feature in the upcoming release. See https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes for more details on how to create reno release note 🙏
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
components/preprocessors/document_splitter.py | 2 | 98.26% | ||
<!-- | Total: | 2 | --> |
Totals | |
---|---|
Change from base Build 10827600883: | 0.01% |
Covered Lines: | 7202 |
Relevant Lines: | 7974 |
I've added a release note. Please let me know if I need to modify its name or text content.
Thank you!!
@GivAlz thanks for a quick turnaround. To stay consistent let's use all small caps in reno release note name (with - between words). And let's remove highlights as that entry is reserved only for major features we want to highlight to users. Although cool perhaps this feature doesn't cross the highlights threshold this time :-)
Looks much better now @GivAlz - to integrate you need to sign the contribution agreement- pretty much standard procedure in most bigger open source projects 🙏
@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch
@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch
Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.
@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch
Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.
No worries, I've been doing this for over a year and I forget all the time as well. Now I have pre-commit check notes :-) Please review https://github.com/deepset-ai/haystack/pull/8336/commits/6a592503e454b276cb7106b88c2063a6908ce113 and say if there is something off
@GivAlz on my last pass through b4 integration I realized we don't (de)serialize the function in this component. I'll add those changes directly on your branch
Sorry I forgot about that; I guess it could be useful to note this in the doc string for the function.
No worries, I've been doing this for over a year and I forget all the time as well. Now I have pre-commit check notes :-) Please review 6a59250 and say if there is something off
LGTM! Just wondering if it makes sense to add a note on the fact that, if the method to_dict is used, the function must be serialisable, but it should be obvious and I think that the error thrown would be pretty clear...if you don't think it is necessary (or maybe a note could be added in the documentation), then I guess it's good to merge!
I think we treat it as serializable due to its simple string interface. We do this quite often throughout the codebase - it is ok most likely. I think we can merge this now 🚀
Proposed Changes:
Adding the possibility to pass a function and personalise the way in which DocumentSplitter defines a unit.
This means a user can, for example, use the following to split and define units:
splitter_function = lambda text: re.split('[\n]{2,}, text)
(or use spacy or anything else).
How did you test it?
Added two tests with two "mock" splitter functions.
Notes for the reviewer
There are some issues related to document splitting #5922 . Given the fact that the current methods are very basic and the issues have been open for moths, I think it would make sense to let the user define how text is split.