Open siddhsql opened 11 months ago
This example from OpenAI should help: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py
thanks. looked at that code https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py#L39 :
tokens = tokenizer.encode(text, disallowed_special=())
isn't this going to cause a problem because its taking a long piece of text which might overflow the context size and trying to tokenize it? i.e., shouldn't it be chunking the text before tokenzing it in the first place?
max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
On Thu, Aug 3, 2023 at 6:20 AM Andrew DiLosa @.***> wrote:
This example drone OpenAI should help: https://github.com/openai/chatgpt-retrieval-plugin/blob/main/services/chunks.py
— Reply to this email directly, view it on GitHub https://github.com/UKPLab/sentence-transformers/issues/2268#issuecomment-1663973414, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6NWEK2PDBH2RNOXCLWAG6TXTOQRZANCNFSM6AAAAAA3BRVX4Q . You are receiving this because you authored the thread.Message ID: @.***>
Given a long document, how do I chop it up into chunks so that each chunk is within the max sequence length of a model?