Questions about multimodal retrieval

HKUDS / LightRAG

"LightRAG: Simple and Fast Retrieval-Augmented Generation"

https://arxiv.org/abs/2410.05779

MIT License

9.42k stars 1.15k forks source link

Questions about multimodal retrieval #129

Closed SLKAlgs closed 3 weeks ago

SLKAlgs commented 4 weeks ago

I want to use images as queries for retrieval. Can I modify it by directly replacing the model in embedding_func with a multimodal large model? If not, please tell me what needs to be changed. Thank you.

Dormiveglia-elf commented 4 weeks ago

Interesting question, I think you can try directly replacing the embedding model (though it’s not guaranteed to work). However, I believe that multimodal models may not be as detailed in their text embeddings, which could lead to a loss of accuracy. Another approach would be to use multimodal large language models like GPT-4o or qwen-VL-Max(100 million free tokens for new user) to first generate descriptive text from the image, and then use this text for querying and matching. This way, you can continue to use the text embedding model. I look forward to your further experimental results.

SLKAlgs commented 2 weeks ago

有趣的问题，我认为您可以尝试直接替换嵌入模型（尽管不能保证一定有效）。但是，我认为多模态模型的文本嵌入可能不够详细，这可能会导致准确性下降。另一种方法是使用多模态大型语言模型，如 GPT-4o 或 qwen-VL-Max（新用户可获得 1 亿个免费令牌），首先从图像中生成描述性文本，然后使用此文本进行查询和匹配。这样，您就可以继续使用文本嵌入模型。我期待您进一步的实验结果。

My understanding is to first use GPT-4o or other multimodal large models to generate a descriptive caption for the image I want to use, and then concatenate it with the query intended for LightRAG retrieval (e.g., "What's this?") to form a new query, which will then produce the retrieval-based answer. Do you think this approach is reasonable?

Dormiveglia-elf commented 2 weeks ago

有趣的问题，我认为您可以尝试直接替换嵌入模型（尽管不能保证一定有效）。但是，我认为多模态模型的文本嵌入可能不够详细，这可能会导致准确性下降。另一种方法是使用多模态大型语言模型，如 GPT-4o 或 qwen-VL-Max（新用户可获得 1 亿个免费令牌），首先从图像中生成描述性文本，然后使用此文本进行查询和匹配。这样，您就可以继续使用文本嵌入模型。我期待您进一步的实验结果。

My understanding is to first use GPT-4o or other multimodal large models to generate a descriptive caption for the image I want to use, and then concatenate it with the query intended for LightRAG retrieval (e.g., "What's this?") to form a new query, which will then produce the retrieval-based answer. Do you think this approach is reasonable?

Yes, I think it is reasonable.