How to chop up a long document into chunks of max sequence length?

siddhsql commented 11 months ago

Given a long document, how do I chop it up into chunks so that each chunk is within the max sequence length of a model?

adilosa commented 11 months ago

This example from OpenAI should help: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py

siddhsql commented 11 months ago

thanks. looked at that code https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py#L39 :

tokens = tokenizer.encode(text, disallowed_special=())

isn't this going to cause a problem because its taking a long piece of text which might overflow the context size and trying to tokenize it? i.e., shouldn't it be chunking the text before tokenzing it in the first place?

https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode

max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.

If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

On Thu, Aug 3, 2023 at 6:20 AM Andrew DiLosa @.***> wrote:

This example drone OpenAI should help: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py

— Reply to this email directly, view it on GitHub https://github.com/UKPLab/sentence-transformers/issues/2268#issuecomment-1663973414, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6NWEK2PDBH2RNOXCLWAG6TXTOQRZANCNFSM6AAAAAA3BRVX4Q . You are receiving this because you authored the thread.Message ID: @.***>

UKPLab / sentence-transformers

How to chop up a long document into chunks of max sequence length? #2268