eosphoros-ai / DB-GPT

AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents
https://docs.dbgpt.site
MIT License
12.56k stars 1.64k forks source link

[Bug] [Knowledge] What is the maximum value of chunk_size ? #1650

Open 0sengseng0 opened 3 weeks ago

0sengseng0 commented 3 weeks ago

Search before asking

Operating system information

Linux

Python version information

=3.11

DB-GPT version

main

Related scenes

Installation Information

Device information

GPU

Models information

embedding: text2vec-large-chinese

What happened

What is the maximum value of chunk_size? Given that the max_seq_length for text2vec-large-chinese is 512, does truncation occur if the chunk size exceeds this limit?

What you expected to happen

1

How to reproduce

1

Additional context

No response

Are you willing to submit PR?

Aries-ckt commented 3 weeks ago

chunk_size depends on the size you want to divide into chunks. There is no maximum value.

0sengseng0 commented 3 weeks ago

chunk_size depends on the size you want to divide into chunks. There is no maximum value.

Given that the text2vec-large-chinese model has a maximum sequence length of 512 tokens, how do you guarantee that each chunk of text is completely transformed into a vector? I'm in the process of splitting docx documents, and a single chunk can be substantial, potentially spanning multiple pages. I'm concerned about content being truncated.

0sengseng0 commented 3 weeks ago

chunk_size depends on the size you want to divide into chunks. There is no maximum value. I found that when embedding, its max_seq_length is still 512. So, how do you handle more than 512 tokens? Also, the model_max_length doesn't seem to work? image