[Bug] [Knowledge] What is the maximum value of chunk_size ?

0sengseng0 commented 3 weeks ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

=3.11

DB-GPT version

main

Related scenes

[ ] Chat Data
[ ] Chat Excel
[ ] Chat DB
[X] Chat Knowledge
[ ] Model Management
[ ] Dashboard
[ ] Plugins

Installation Information

Device information

GPU

Models information

embedding: text2vec-large-chinese

What happened

What is the maximum value of chunk_size? Given that the max_seq_length for text2vec-large-chinese is 512, does truncation occur if the chunk size exceeds this limit?

What you expected to happen

1

How to reproduce

1

Additional context

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Aries-ckt commented 3 weeks ago

chunk_size depends on the size you want to divide into chunks. There is no maximum value.

0sengseng0 commented 3 weeks ago

chunk_size depends on the size you want to divide into chunks. There is no maximum value.

Given that the text2vec-large-chinese model has a maximum sequence length of 512 tokens, how do you guarantee that each chunk of text is completely transformed into a vector? I'm in the process of splitting docx documents, and a single chunk can be substantial, potentially spanning multiple pages. I'm concerned about content being truncated.

0sengseng0 commented 3 weeks ago

chunk_size depends on the size you want to divide into chunks. There is no maximum value. I found that when embedding, its max_seq_length is still 512. So, how do you handle more than 512 tokens? Also, the model_max_length doesn't seem to work?

eosphoros-ai / DB-GPT