Multimodal Memory and Experiences for Dialog System Using MPCHAT

ahnjaewoo / MPCHAT

📸 Code and Dataset for our ACL 2023 paper: "MPCHAT: Towards Multimodal Persona-Grounded Conversation"

Creative Commons Attribution 4.0 International

21 stars 0 forks source link

Hello, thank you for your amazing work!

I believe that in order for AI and dialogue systems to operate autonomously, they should have multimodal memories and experiences, such as scene and episodic memories. (It's like the memory that appears in the movie "After Yang" [1])

I am considering the following system for the prototype. It is a system that treats the multimodal vector DB as the system's own memory and experience, and responds to human utterances using the system's own multimodal memory. The data to be stored in the DB is assumed to be a combination of images and text related to episodic memory, such as MPCHAT [2].

However, current multimodal LLMs are specialized in understanding images from a third-person perspective, and cannot treat images as the system's own memories or experiences.

When creating a prototype like this using MPCHAT, do I need to perform fine tuning instead of using multimodal RAG? Also, do you have any plans to do anything like the above as follow-up research to MPCHAT?

Thank you.

References: [1] Yang's Memories Scene from AFTER YANG, https://youtu.be/cIJ8-HGWlKw?feature=shared [2] MPCHAT: Towards Multimodal Persona-Grounded Conversation, https://arxiv.org/abs/2305.17388

Hello,

Thank you so much for reaching out and for your kind words about our MPCHAT dataset! Your concept of a system that uses a multimodal vector database as its own memory and experience is truly fascinating.

For your prototype, employing a recent multimodal Large Language Model (MLLM) or a multimodal Retrieval-Augmented Generation (MRAG) model would be a promising starting point. These models have been trained on extensive datasets, including the Ego4D dataset, which could also serve as episodic memories. This aligns with the concepts discussed in the Otter [1] and Mimic-It [2], where such models demonstrate an understanding of experiences from a first-person perspective. Once you've implemented your MLLM, fine-tuning it on the MPCHAT dataset could further improve its performance to suit your specific requirements, enhancing its ability to utilize multimodal memories in dialogue.

As for follow-up research, while I am not currently conducting any, I am indeed interested in the idea of expanding MPCHAT into the video domain, particularly integrating AR/VR glasses to create a more immersive experience. This could potentially open up new avenues for research and development in multimodal dialogue systems.

[1] Otter: A Multi-Modal Model with In-Context Instruction Tuning, https://arxiv.org/abs/2305.03726 [2] MIMIC-IT: Multi-Modal In-Context Instruction Tuning, https://arxiv.org/abs/2306.05425

ahnjaewoo / MPCHAT

Multimodal Memory and Experiences for Dialog System Using MPCHAT #5