[Bug] [Embedding] Tongyiproxy embedding model error

dj-jack001 commented 3 months ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

3.10

DB-GPT version

main

Related scenes

[ ] Chat Data
[ ] Chat Excel
[ ] Chat DB
[X] Chat Knowledge
[ ] Model Management
[ ] Dashboard
[ ] Plugins

Installation Information

Device information

CPU

Models information

LLM:zhipu_proxyllm Embedding model:proxy_tongyi

What happened

When I parse a pdf document with tongyi embedding model, I get an error: document embedding, failed:海澜之家2023年报.pdf, 'NoneType' object is not subscriptable 当我用通义向量模型解析pdf文档时，发生报错： document embedding, failed:海澜之家2023年报.pdf, 'NoneType' object is not subscriptable

log.txt

What you expected to happen

I followed the guidelines of the official document to configure tongyi embedding, but it still reported an error
There is no problem with the parsing document of other embedding models, which means that the document itself is correct

1、我按照官方文档的指引进行tongyi embedding的配置，但还是报错了 2、其他embedding模型解析文档没有问题，说明文档本身没错

How to reproduce

Use tongyi embedding: text-embedding-v1
Start DB-GPT and pass in the pdf file for parsing

1、使用tongyi embedding：text-embedding-v1 2、启动DB-GPT，传入pdf文档进行解析

Additional context

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

helloworld1973 commented 3 months ago

same problem

dusens commented 3 months ago

EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={your-api-key}

should use this config

but use this have another problem

document embedding, failed:123.md, Embedding dimension 1536 does not match collection dimensionality 1024

kuschzzp commented 2 months ago

same problem Embedding dimension 1536 does not match collection dimensionality 1024 Have you found a solution? @dusens

dusens commented 2 months ago

In the project, there is a modification of the 1024 project value. You can search for it. Whether the vector is 1024 or 1536 depends on your business scenario. OpenAI has related functionality for vector compression, allowing you to compress from 1536 dimensions to 1024 dimensions. You can check this out. @kuschzzp

tccgogogo commented 1 month ago

EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={your-api-key}

should use this config

but use this have another problem

document embedding, failed:123.md, Embedding dimension 1536 does not match collection dimensionality 1024

Why am I still getting errors when I use this configuration?

dusens commented 1 month ago

EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的api密钥} 应该使用这个配置但使用这个还有另一个问题文档嵌入，失败：123.md，嵌入维度 1536 与集合维度 1024 不匹配

为什么我使用此配置仍然会出现错误？

What version are you using? Your error messages are too brief.

tccgogogo commented 1 month ago

EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的api密钥} 应该使用这个配置但使用这个还有另一个问题文档嵌入，失败：123.md，嵌入维度 1536 与集合维度 1024 不匹配

为什么我使用此配置仍然会出现错误？

What version are you using? Your error messages are too brief.

The version is V0.5.10. I have nothing error messages.

mzaispace commented 1 month ago

EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的api密钥} 应该使用这个配置但使用这个还有另一个问题文档嵌入，失败：123.md，嵌入维度 1536 与集合维度 1024 不匹配

为什么我使用此配置仍然会出现错误？

What version are you using? Your error messages are too brief.

The version is V0.5.10. I have nothing error messages.

这个文件 dbgpt/rag/embedding/embeddings.py 这个函数embed_documents 替换成如下函数可解决这个问题：

def embed_documents(self, texts: List[str]) -> List[List[float]]:

    """Get the embeddings for a list of texts.

    Args:
        texts (Documents): A list of texts to get embeddings for.

    Returns:
        Embedded texts as List[List[float]], where each inner List[float]
            corresponds to a single input text.
    """
    from dashscope import TextEmbedding

    embeddings = []

    # 将文本分批处理，每批最多25条
    for i in range(0, len(texts), 25):
        batch_texts = texts[i:i + 25]
        resp = TextEmbedding.call(
            model=self.model_name, input=batch_texts, api_key=self._api_key
        )
        if "output" not in resp:
            raise RuntimeError(resp["message"])

        # 提取并排序嵌入
        batch_embeddings = resp["output"]["embeddings"]
        sorted_embeddings = sorted(batch_embeddings, key=lambda e: e["text_index"])
        embeddings.extend([result["embedding"] for result in sorted_embeddings])

    return embeddings

dusens commented 1 month ago

EMBEDDING_MODEL=proxy_tongyi proxy_tongyi_proxy_backend=text-embedding-v1 proxy_tongyi_proxy_api_key={你的apiKey} 应该使用这个配置但使用这个还有另一个问题文档嵌入，失败：123.md，嵌入维度 1536 与集合维度 1024 不匹配

为什么我使用此配置仍然会出现错误？

你使用的是什么版本？你的错误信息太简短了。

版本是 V0.5.10。我没有收到任何错误消息。

这个文件 dbgpt/rag/embedding/embeddings.py 这个函数embed_documents 替换成如下函数可解决这个问题：

def embed_documents(self，texts：List[str]) -> List[List[float]]:
    """Get the embeddings for a list of texts.

    Args:
        texts (Documents): A list of texts to get embeddings for.

    Returns:
        Embedded texts as List[List[float]], where each inner List[float]
            corresponds to a single input text.
    """
    from dashscope import TextEmbedding

    embeddings = []

    # 将文本分批处理，每批最多25条
    for i in range(0, len(texts), 25):
        batch_texts = texts[i:i + 25]
        resp = TextEmbedding.call(
            model=self.model_name, input=batch_texts, api_key=self._api_key
        )
        if "output" not in resp:
            raise RuntimeError(resp["message"])

        # 提取并排序嵌入
        batch_embeddings = resp["output"]["embeddings"]
        sorted_embeddings = sorted(batch_embeddings, key=lambda e: e["text_index"])
        embeddings.extend([result["embedding"] for result in sorted_embeddings])

    return embeddings
If the tests pass, you can submit a fix PR to address this issue. 😊

eosphoros-ai / DB-GPT