Open 0sengseng0 opened 3 weeks ago
chunk_size depends on the size you want to divide into chunks. There is no maximum value.
chunk_size depends on the size you want to divide into chunks. There is no maximum value.
Given that the text2vec-large-chinese model has a maximum sequence length of 512 tokens, how do you guarantee that each chunk of text is completely transformed into a vector? I'm in the process of splitting docx documents, and a single chunk can be substantial, potentially spanning multiple pages. I'm concerned about content being truncated.
chunk_size depends on the size you want to divide into chunks. There is no maximum value. I found that when embedding, its max_seq_length is still 512. So, how do you handle more than 512 tokens? Also, the model_max_length doesn't seem to work?
Search before asking
Operating system information
Linux
Python version information
DB-GPT version
main
Related scenes
Installation Information
[ ] Installation From Source
[ ] Docker Installation
[ ] Docker Compose Installation
[ ] Cluster Installation
[ ] AutoDL Image
[X] Other
Device information
GPU
Models information
embedding: text2vec-large-chinese
What happened
What is the maximum value of chunk_size? Given that the max_seq_length for text2vec-large-chinese is 512, does truncation occur if the chunk size exceeds this limit?
What you expected to happen
1
How to reproduce
1
Additional context
No response
Are you willing to submit PR?