UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.45k stars 2.4k forks source link

How to chop up a long document into chunks of max sequence length? #2268

Open siddhsql opened 11 months ago

siddhsql commented 11 months ago

Given a long document, how do I chop it up into chunks so that each chunk is within the max sequence length of a model?

adilosa commented 11 months ago

This example from OpenAI should help: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py

siddhsql commented 11 months ago

thanks. looked at that code https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py#L39 :

tokens = tokenizer.encode(text, disallowed_special=())

isn't this going to cause a problem because its taking a long piece of text which might overflow the context size and trying to tokenize it? i.e., shouldn't it be chunking the text before tokenzing it in the first place?

https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode

On Thu, Aug 3, 2023 at 6:20 AM Andrew DiLosa @.***> wrote:

This example drone OpenAI should help: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py

— Reply to this email directly, view it on GitHub https://github.com/UKPLab/sentence-transformers/issues/2268#issuecomment-1663973414, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6NWEK2PDBH2RNOXCLWAG6TXTOQRZANCNFSM6AAAAAA3BRVX4Q . You are receiving this because you authored the thread.Message ID: @.***>