HKUDS / LightRAG

"LightRAG: Simple and Fast Retrieval-Augmented Generation"
https://arxiv.org/abs/2410.05779
MIT License
8.22k stars 959 forks source link

Regarding modifying LightRAG for multimodal tasks #237

Open SLKAlgs opened 4 days ago

SLKAlgs commented 4 days ago

I am currently planning to prepend an image to the query section, meaning the query will consist of an image along with a question about it. The system will then search the provided documents to find the answer. My understanding is to first use GPT-4O or other multimodal large models to generate a descriptive caption for the image I want to use, and then concatenate it with the query intended for LightRAG retrieval (e.g., "What's this?") to form a new query, which will then produce the retrieval-based answer. Do you think this approach is reasonable?

LarFii commented 1 day ago

This might require more detailed discussion. For multimodal tasks, the construction of the KG might need to differ. However, I think it’s also worth trying to convert all the images into text using a multimodal large model.