Open SLKAlgs opened 4 days ago
This might require more detailed discussion. For multimodal tasks, the construction of the KG might need to differ. However, I think it’s also worth trying to convert all the images into text using a multimodal large model.
I am currently planning to prepend an image to the query section, meaning the query will consist of an image along with a question about it. The system will then search the provided documents to find the answer. My understanding is to first use GPT-4O or other multimodal large models to generate a descriptive caption for the image I want to use, and then concatenate it with the query intended for LightRAG retrieval (e.g., "What's this?") to form a new query, which will then produce the retrieval-based answer. Do you think this approach is reasonable?