Add split by token to PreProcessor

danielbichuetti commented 1 year ago

Is your feature request related to a problem? Please describe.

As a user, I want to split my documents based on a token limit. I would like to use HuggingFace tokenizers, Sentence Transformers and OpenAI tokenizers. I wish to avoid setting this value based on try/catch attempts using words or sentences.
I would like to use regex to replace strings and replace newlines with spaces

Describe the solution you'd like Allow the usage of any Hugging Face tokenizer, Sentence Transformers or OpenAI tokenizer. User will choose the mode tokens, set the tokenizer type (hf, sbert or OpenAI), the model name. The remaining PreProcessor function will stay the same.

Describe alternatives you've considered Keep like it's, and each user will do its experiments about word/sentence split length.

Additional context This is just a meta code (exported from our internal codebase), but it would be something like:

def __init__(
        self,
        mode: SplitMode = "words",
        tokenizer: Optional[TokenizerMode] = None,
        tokenizer_model: Optional[str] = None,
        language: Optional[str]=None,
        clean_whitespace: Optional[bool] = None,
        clean_header_footer: Optional[bool] = None,
        clean_empty_lines: Optional[bool] = None,
        remove_substrings: Optional[List[str]] = None,
        remove_substrings_regex: Optional[Union[List[Pattern],List[str]]] = None,
        replace_newlines: Optional[bool] = None,
        split_length: int = 180,
        split_overlap: int = 25,
        split_respect_sentence_boundary: bool = False,
        id_hash_keys: Optional[List[str]] = None,
    ) -> None:

danielbichuetti commented 1 year ago

This implementation will work with the current PreProcessor and pipelines. The code is already tested.

@ZanSara Tagging you because of #4256.

ZanSara commented 1 year ago

Hey @danielbichuetti, if you already have a working implementation I'll be glad to have a look.

However let me warn you: PreProcessor has almost no test on the corner cases. My last attempt at introducing them revealed so many bugs hidden in the implementation that I had to drop the idea of touching it ever again. So... do you plan to add any tests? :sweat_smile: It might be way harder than it looks.

If you do open a PR please tag me and I'll do my best to follow it up.

danielbichuetti commented 1 year ago

@ZanSara We can add tests for all 3 token splitting modes. Or do you need that I add tests to the whole implementation? 😅

I'll be honest. This node code looks like if we had a house, and we keep adding features and doing extra constructions, adapting everything. It's constructed, but not much planning hahahah Sorry for being too honest.

I have one working implementation. We exported the Haystack PreProcessor and built a DocumentProcessor changing to a more typed structure and adding token splitter. I can do the reversal and adapt the DocumentProcessor for the current PreProcessor.

The way it was adapted works with the current implementation. Of course, if we change to a more optimized way of doing (like using front and tail calc.) the token calculation, we would need to refactor many methods. And probably your concerns will become a problem.

This implementation is using a linear approach.

ZanSara commented 1 year ago

Ok this sounds interesting! I'd be happy to see the DocumentProcessor. I can't guarantee I will have the time to review it quickly, but I can sure look at it and consider if it can be used in place of PreProcessor, at least in some situations. Unless I find some huge issues I believe we could have it in the core.

For the tests: well technically if you add a new parameter you should test it against at least a few others... which might already surface problems if the tests are done properly. I didn't expect you to write tests for the rest of the codebase, don't worry :smile:

I'll be honest. This node code looks like if we had a house, and we keep adding features and doing extra constructions, adapting everything. It's constructed, but not much planning hahahah Sorry for being too honest.

I couldn't agree more with you on this point :joy: Let me share an anecdote. PreProcessor has one amazing feature that makes it amazingly fast: it loses track of whitespace when splitting :see_no_evil: And it's basically impossible to fix this bug of losing whitespace without slowing it down about 2x, because clearly not counting something will always be faster than counting something, and PreProcessor currently loses that whitespace very efficiently. You can imagine my joy when I found that out :sweat_smile:

danielbichuetti commented 1 year ago

@ZanSara Yes. The first time we did a port of it, we noticed that. Touching the original PreProcessor code was like: hey, better start from scratch than mod it. We tried a non-linear approach to calculate. But touching the old code was a nightmare. I can see what you felt in the past. So, we decided for a linear calculation.

I'll use the same name PreProcessor as it's already a derivative implementation. And it's, indeed, a modded PreProcessor, just adding features.

I'll open a PR, so we can explore it.

liorshk commented 1 year ago

+1

masci commented 8 months ago

I need this myself :) moving to a feature request for 2.x

deepset-ai / haystack

Add split by token to PreProcessor #4392