Closed dj-jack001 closed 17 hours ago
same problem
EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={your-api-key}
should use this config
but use this have another problem
document embedding, failed:123.md, Embedding dimension 1536 does not match collection dimensionality 1024
same problem Embedding dimension 1536 does not match collection dimensionality 1024
Have you found a solution? @dusens
In the project, there is a modification of the 1024 project value. You can search for it. Whether the vector is 1024 or 1536 depends on your business scenario. OpenAI has related functionality for vector compression, allowing you to compress from 1536 dimensions to 1024 dimensions. You can check this out. @kuschzzp
EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={your-api-key}
should use this config
but use this have another problem
document embedding, failed:123.md, Embedding dimension 1536 does not match collection dimensionality 1024
Why am I still getting errors when I use this configuration?
EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的api密钥} 应该使用这个配置 但使用这个还有另一个问题 文档嵌入,失败:123.md,嵌入维度 1536 与集合维度 1024 不匹配
为什么我使用此配置仍然会出现错误?
What version are you using? Your error messages are too brief.
EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的api密钥} 应该使用这个配置 但使用这个还有另一个问题 文档嵌入,失败:123.md,嵌入维度 1536 与集合维度 1024 不匹配
为什么我使用此配置仍然会出现错误?
What version are you using? Your error messages are too brief.
The version is V0.5.10. I have nothing error messages.
EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的api密钥} 应该使用这个配置 但使用这个还有另一个问题 文档嵌入,失败:123.md,嵌入维度 1536 与集合维度 1024 不匹配
为什么我使用此配置仍然会出现错误?
What version are you using? Your error messages are too brief.
The version is V0.5.10. I have nothing error messages.
这个文件 dbgpt/rag/embedding/embeddings.py 这个函数embed_documents 替换成如下函数可解决这个问题:
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Get the embeddings for a list of texts.
Args:
texts (Documents): A list of texts to get embeddings for.
Returns:
Embedded texts as List[List[float]], where each inner List[float]
corresponds to a single input text.
"""
from dashscope import TextEmbedding
embeddings = []
# 将文本分批处理,每批最多25条
for i in range(0, len(texts), 25):
batch_texts = texts[i:i + 25]
resp = TextEmbedding.call(
model=self.model_name, input=batch_texts, api_key=self._api_key
)
if "output" not in resp:
raise RuntimeError(resp["message"])
# 提取并排序嵌入
batch_embeddings = resp["output"]["embeddings"]
sorted_embeddings = sorted(batch_embeddings, key=lambda e: e["text_index"])
embeddings.extend([result["embedding"] for result in sorted_embeddings])
return embeddings
EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的apiKey} 应该使用这个配置 但使用这个还有另一个问题 文档嵌入,失败:123.md,嵌入维度 1536 与集合维度 1024 不匹配
为什么我使用此配置仍然会出现错误?
你使用的是什么版本?你的错误信息太简短了。
版本是 V0.5.10。我没有收到任何错误消息。
这个文件 dbgpt/rag/embedding/embeddings.py 这个函数embed_documents 替换成如下函数可解决这个问题:
def embed_documents(self,texts:List[str]) -> List[List[float]]:
"""Get the embeddings for a list of texts. Args: texts (Documents): A list of texts to get embeddings for. Returns: Embedded texts as List[List[float]], where each inner List[float] corresponds to a single input text. """ from dashscope import TextEmbedding embeddings = [] # 将文本分批处理,每批最多25条 for i in range(0, len(texts), 25): batch_texts = texts[i:i + 25] resp = TextEmbedding.call( model=self.model_name, input=batch_texts, api_key=self._api_key ) if "output" not in resp: raise RuntimeError(resp["message"]) # 提取并排序嵌入 batch_embeddings = resp["output"]["embeddings"] sorted_embeddings = sorted(batch_embeddings, key=lambda e: e["text_index"]) embeddings.extend([result["embedding"] for result in sorted_embeddings]) return embeddings
If the tests pass, you can submit a fix PR to address this issue. 😊
Search before asking
Operating system information
Linux
Python version information
3.10
DB-GPT version
main
Related scenes
Installation Information
[X] Installation From Source
[ ] Docker Installation
[ ] Docker Compose Installation
[ ] Cluster Installation
[ ] AutoDL Image
[ ] Other
Device information
CPU
Models information
LLM:zhipu_proxyllm Embedding model:proxy_tongyi
What happened
When I parse a pdf document with tongyi embedding model, I get an error: document embedding, failed:海澜之家2023年报.pdf, 'NoneType' object is not subscriptable 当我用通义向量模型解析pdf文档时,发生报错: document embedding, failed:海澜之家2023年报.pdf, 'NoneType' object is not subscriptable
log.txt
What you expected to happen
1、我按照官方文档的指引进行tongyi embedding的配置,但还是报错了 2、其他embedding模型解析文档没有问题,说明文档本身没错
How to reproduce
1、使用tongyi embedding:text-embedding-v1 2、启动DB-GPT,传入pdf文档进行解析
Additional context
No response
Are you willing to submit PR?