14 New submissions for Wed, 24 Jan 24

Keyword: alignment

Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment

Authors: Keming Lu, Bowen Yu, Chang Zhou, Jingren Zhou
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.12474
Pdf link: https://arxiv.org/pdf/2401.12474
Abstract Considerable efforts have been invested in augmenting the role-playing proficiency of open-source large language models (LLMs) by emulating proprietary counterparts. Nevertheless, we posit that LLMs inherently harbor role-play capabilities, owing to the extensive knowledge of characters and potential dialogues ingrained in their vast training corpora. Thus, in this study, we introduce Ditto, a self-alignment method for role-play. Ditto capitalizes on character knowledge, encouraging an instruction-following LLM to simulate role-play dialogues as a variant of reading comprehension. This method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold regarding the number of roles. Subsequently, we fine-tune the LLM using this self-generated dataset to augment its role-playing capabilities. Upon evaluating our meticulously constructed and reproducible role-play benchmark and the roleplay subset of MT-Bench, Ditto, in various parameter scales, consistently maintains a consistent role identity and provides accurate role-specific knowledge in multi-turn role-play conversations. Notably, it outperforms all open-source role-play baselines, showcasing performance levels comparable to advanced proprietary chatbots. Furthermore, we present the first comprehensive cross-supervision alignment experiment in the role-play domain, revealing that the intrinsic capabilities of LLMs confine the knowledge within role-play. Meanwhile, the role-play styles can be easily acquired with the guidance of smaller models. We open-source related resources at https://github.com/OFA-Sys/Ditto.
Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model
Authors: Zhiwei He, Xing Wang, Wenxiang Jiao, Zhuosheng Zhang, Rui Wang, Shuming Shi, Zhaopeng Tu
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.12873
Pdf link: https://arxiv.org/pdf/2401.12873
Abstract Insufficient modeling of human preferences within the reward model is a major obstacle for leveraging human feedback to improve translation quality. Fortunately, quality estimation (QE), which predicts the quality of a given translation without reference, has achieved impressive alignment with human evaluations in the last two years. In this work, we investigate the potential of employing the QE model as the reward model (the QE-based reward model) to predict human preferences for feedback training. We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines. We examine the problem and argue that the vulnerability of the QE model might lead to high rewards for incorrect translations, resulting in overoptimization and error propagation. To address the problem, we adopt a simple yet effective method that uses heuristic rules to detect the incorrect translations and assigns a penalty term to the QE-based rewards for the detected incorrect translations. Experimental results show that the proposed QE-based feedback training achieves consistent and significant improvements across various settings, further verified through human preference studies. Our subsequent analysis demonstrates the high data efficiency of the proposed QE-based feedback training: the proposed approach using a small amount of monolingual data can outperform systems using larger parallel corpora.
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Authors: Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, Noah D. Goodman
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.12631
Pdf link: https://arxiv.org/pdf/2401.12631
Abstract We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions". We first review Makelov et al. (2023)'s technical notion of what an "interpretability illusion" is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering "illusions" can reject explanations they consider "non-illusory". We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)'s examples and discussion have undoubtedly pushed the field of interpretability forward.
Gradient Flow of Energy: A General and Efficient Approach for Entity Alignment Decoding
Authors: Yuanyi Wang, Haifeng Sun, Jingyu Wang, Qi Qi, Shaoling Sun, Jianxin Liao
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.12798
Pdf link: https://arxiv.org/pdf/2401.12798
Abstract Entity alignment (EA), a pivotal process in integrating multi-source Knowledge Graphs (KGs), seeks to identify equivalent entity pairs across these graphs. Most existing approaches regard EA as a graph representation learning task, concentrating on enhancing graph encoders. However, the decoding process in EA - essential for effective operation and alignment accuracy - has received limited attention and remains tailored to specific datasets and model architectures, necessitating both entity and additional explicit relation embeddings. This specificity limits its applicability, particularly in GNN-based models. To address this gap, we introduce a novel, generalized, and efficient decoding approach for EA, relying solely on entity embeddings. Our method optimizes the decoding process by minimizing Dirichlet energy, leading to the gradient flow within the graph, to promote graph homophily. The discretization of the gradient flow produces a fast and scalable approach, termed Triple Feature Propagation (TFP). TFP innovatively channels gradient flow through three views: entity-to-entity, entity-to-relation, and relation-to-entity. This generalized gradient flow enables TFP to harness the multi-view structural information of KGs. Rigorous experimentation on diverse real-world datasets demonstrates that our approach significantly enhances various EA methods. Notably, the approach achieves these advancements with less than 6 seconds of additional computational time, establishing a new benchmark in efficiency and adaptability for future EA methods.
Red Teaming Visual Language Models
Authors: Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.12915
Pdf link: https://arxiv.org/pdf/2401.12915
Abstract VLMs (Vision-Language Models) extend the capabilities of LLMs (Large Language Models) to accept multimodal inputs. Since it has been verified that LLMs can be induced to generate harmful or inaccurate content through specific test cases (termed as Red Teaming), how VLMs perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question. To explore this problem, we present a novel red teaming dataset RTVLM, which encompasses 10 subtasks (e.g., image misleading, multi-modal jail-breaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness). Our RTVLM is the first red-teaming dataset to benchmark current VLMs in terms of these 4 different aspects. Detailed analysis shows that 10 prominent open-sourced VLMs struggle with the red teaming in different degrees and have up to 31% performance gap with GPT-4V. Additionally, we simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning (SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM test set, 13% in MM-Hal, and without noticeable decline in MM-Bench, overpassing other LLaVA-based models with regular alignment data. This reveals that current open-sourced VLMs still lack red teaming alignment. Our code and datasets will be open-source.
Keyword: aligning

There is no result

Keyword: align

The Ethics of Interaction: Mitigating Security Threats in LLMs
Authors: Ashutosh Kumar, Sagarika Singh, Shiv Vignesh Murty, Swathy Ragupathy
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.12273
Pdf link: https://arxiv.org/pdf/2401.12273
Abstract This paper comprehensively explores the ethical challenges arising from security threats to Language Learning Models (LLMs). These intricate digital repositories are increasingly integrated into our daily lives, making them prime targets for attacks that can compromise their training data and the confidentiality of their data sources. The paper delves into the nuanced ethical repercussions of such security threats on society and individual privacy. We scrutinize five major threats: prompt injection, jailbreaking, Personal Identifiable Information (PII) exposure, sexually explicit content, and hate based content, going beyond mere identification to assess their critical ethical consequences and the urgency they create for robust defensive strategies. The escalating reliance on LLMs underscores the crucial need for ensuring these systems operate within the bounds of ethical norms, particularly as their misuse can lead to significant societal and individual harm. We propose conceptualizing and developing an evaluative tool tailored for LLMs, which would serve a dual purpose, guiding developers and designers in preemptive fortification of backend systems and scrutinizing the ethical dimensions of LLM chatbot responses during the testing phase. By comparing LLM responses with those expected from humans in a moral context, we aim to discern the degree to which AI behaviors align with the ethical values held by a broader society. Ultimately, this paper not only underscores the ethical troubles presented by LLMs, it also highlights a path toward cultivating trust in these systems.
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
Authors: Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Ted Xiao, Peng Xu, Steve Xu, Zhuo Xu
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.12963
Pdf link: https://arxiv.org/pdf/2401.12963
Abstract Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.
Keyword: vision language

There is no result

Keyword: vision-language

The Neglected Tails of Vision-Language Models
Authors: Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.12425
Pdf link: https://arxiv.org/pdf/2401.12425
Abstract Vision-language models (VLMs) excel in zero-shot recognition but exhibit drastically imbalanced performance across visual concepts. For example, CLIP, despite an impressive mean zero-shot accuracy on ImageNet (72.7%), yields $<$10% on ten concepts (e.g., gyromitra and night snake), presumably, because these concepts are under-represented in VLMs' imbalanced pretraining data. Yet, assessing this imbalance is challenging as it is non-trivial to calculate the frequency of specific concepts within VLMs' large-scale pretraining data. Our work makes the first attempt to measure the concept frequency by analyzing pretraining texts. We use off-the-shelf language models to help count relevant texts that contain synonyms of the given concepts and resolve linguistic ambiguity. We confirm that popular VLM datasets like LAION indeed exhibit long-tailed concept distributions, which strongly correlate with per-class accuracies. Further, contemporary multimodal systems, e.g., visual chatbots and text-to-image generators, also struggle with the rare concepts identified by our method. To mitigate VLMs' imbalanced performance in zero-shot recognition, we propose REtrieval-Augmented Learning REAL. First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in VLMs' pretraining texts. This already outperforms human-engineered and LLM-generated prompts over nine benchmark datasets, likely because VLMs have seen more images associated with the frequently used synonyms. Second, REAL uses all the concept synonyms to retrieve a small, class-balanced set of pretraining data to train a robust classifier. REAL surpasses the recent retrieval-augmented solution REACT, using 400x less storage and 10,000x less training time!
Keyword: language-vision

There is no result

Keyword: phrase-grounding

There is no result

Keyword: phrase grounding

There is no result

Keyword: reference expression comprehension

There is no result

Keyword: chest

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding
Authors: Mirac Suzgun, Adam Tauman Kalai
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2401.12954
Pdf link: https://arxiv.org/pdf/2401.12954
Abstract We introduce meta-prompting, an effective scaffolding technique designed to enhance the functionality of language models (LMs). This approach transforms a single LM into a multi-faceted conductor, adept at managing and integrating multiple independent LM queries. By employing high-level instructions, meta-prompting guides the LM to break down complex tasks into smaller, more manageable subtasks. These subtasks are then handled by distinct "expert" instances of the same LM, each operating under specific, tailored instructions. Central to this process is the LM itself, in its role as the conductor, which ensures seamless communication and effective integration of the outputs from these expert models. It additionally employs its inherent critical thinking and robust verification processes to refine and authenticate the end result. This collaborative prompting approach empowers a single LM to simultaneously act as a comprehensive orchestrator and a panel of diverse experts, significantly enhancing its performance across a wide array of tasks. The zero-shot, task-agnostic nature of meta-prompting greatly simplifies user interaction by obviating the need for detailed, task-specific instructions. Furthermore, our research demonstrates the seamless integration of external tools, such as a Python interpreter, into the meta-prompting framework, thereby broadening its applicability and utility. Through rigorous experimentation with GPT-4, we establish the superiority of meta-prompting over conventional scaffolding methods: When averaged across all tasks, including the Game of 24, Checkmate-in-One, and Python Programming Puzzles, meta-prompting, augmented with a Python interpreter functionality, surpasses standard prompting by 17.1%, expert (dynamic) prompting by 17.3%, and multipersona prompting by 15.2%.
Keyword: x-ray

There is no result

Keyword: clinical

There is no result

Keyword: biomedical

There is no result

Keyword: radiology

There is no result

Keyword: radiography

There is no result

Keyword: medical

There is no result

Keyword: active-learning

There is no result

Keyword: active learning

There is no result

Keyword: chexpert

There is no result

Keyword: vision

Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data
Authors: Leonardo Castro-Gonzalez, Yi-Ling Chung, Hannak Rose Kirk, John Francis, Angus R. Williams, Pica Johansson, Jonathan Bright
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.12295
Pdf link: https://arxiv.org/pdf/2401.12295
Abstract The field of machine learning has recently made significant progress in reducing the requirements for labelled training data when building new models. These cheaper' learning techniques hold significant potential for the social sciences, where development of large labelled training datasets is often a significant practical impediment to the use of machine learning for analytical tasks. In this article we review threecheap' techniques that have developed in recent years: weak supervision, transfer learning and prompt engineering. For the latter, we also review the particular case of zero-shot prompting of large language models. For each technique we provide a guide of how it works and demonstrate its application across six different realistic social science applications (two different tasks paired with three different dataset makeups). We show good performance for all techniques, and in particular we demonstrate how prompting of large language models can achieve high accuracy at very low cost. Our results are accompanied by a code repository to make it easy for others to duplicate our work and use it in their own research. Overall, our article is intended to stimulate further uptake of these techniques in the social sciences.
Keyword: visual

Development of an NLP-driven computer-based test guide for visually impaired students
Authors: Tubo Faustinah Nemieboka, Ikechukwu E. Onyenwe, Doris C. Asogwa
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.12375
Pdf link: https://arxiv.org/pdf/2401.12375
Abstract In recent years, advancements in Natural Language Processing (NLP) techniques have revolutionized the field of accessibility and exclusivity of testing, particularly for visually impaired students (VIS). CBT has shown in years back its relevance in terms of administering exams electronically, making the test process easier, providing quicker and more accurate results, and offering greater flexibility and accessibility for candidates. Yet, its relevance was not felt by the visually impaired students as they cannot access printed documents. Hence, in this paper, we present an NLP-driven Computer-Based Test guide for visually impaired students. It employs a speech technology pre-trained methods to provide real-time assistance and support to visually impaired students. The system utilizes NLP technologies to convert the text-based questions and the associated options in a machine-readable format. Subsequently, the speech technology pre-trained model processes the converted text enabling the VIS to comprehend and analyze the content. Furthermore, we validated that this pre-trained model is not perverse by testing for accuracy using sample audio datasets labels (A, B, C, D, E, F, G) to compare with the voice recordings obtained from 20 VIS which is been predicted by the system to attain values for precision, recall, and F1-scores. These metrics are used to assess the performance of the pre-trained model and have indicated that it is proficient enough to give its better performance to the evaluated system. The methodology adopted for this system is Object Oriented Analysis and Design Methodology (OOADM) where Objects are discussed and built by modeling real-world instances.
Keyword: visio-linguistic

There is no result

Keyword: cross-modal

There is no result

Keyword: modality

There is no result

Keyword: modalities

LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools
Authors: Qianli Wang, Tatiana Anikina, Nils Feldhus, Josef van Genabith, Leonhard Hennig, Sebastian Möller
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.12576
Pdf link: https://arxiv.org/pdf/2401.12576
Abstract Interpretability tools that offer explanations in the form of a dialogue have demonstrated their efficacy in enhancing users' understanding, as one-off explanations may occasionally fall short in providing sufficient information to the user. Current solutions for dialogue-based explanations, however, require many dependencies and are not easily transferable to tasks they were not designed for. With LLMCheckup, we present an easily accessible tool that allows users to chat with any state-of-the-art large language model (LLM) about its behavior. We enable LLMs to generate all explanations by themselves and take care of intent recognition without fine-tuning, by connecting them with a broad spectrum of Explainable AI (XAI) tools, e.g. feature attributions, embedding-based similarity, and prompting strategies for counterfactual and rationale generation. LLM (self-)explanations are presented as an interactive dialogue that supports follow-up questions and generates suggestions. LLMCheckup provides tutorials for operations available in the system, catering to individuals with varying levels of expertise in XAI and supports multiple input modalities. We introduce a new parsing strategy called multi-prompt parsing substantially enhancing the parsing accuracy of LLMs. Finally, we showcase the tasks of fact checking and commonsense question answering.
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Authors: Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.12863
Pdf link: https://arxiv.org/pdf/2401.12863
Abstract Large Language Models (LLMs) have demonstrated impressive performance in natural language processing tasks by leveraging chain of thought (CoT) that enables step-by-step thinking. Extending LLMs with multimodal capabilities is the recent interest, but incurs computational cost and requires substantial hardware resources. To address these challenges, we propose KAM-CoT a framework that integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities for a comprehensive understanding of multimodal tasks. KAM-CoT adopts a two-stage training process with KG grounding to generate effective rationales and answers. By incorporating external knowledge from KGs during reasoning, the model gains a deeper contextual understanding reducing hallucinations and enhancing the quality of answers. This knowledge-augmented CoT reasoning empowers the model to handle questions requiring external context, providing more informed answers. Experimental findings show KAM-CoT outperforms the state-of-the-art methods. On the ScienceQA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-CoT achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness.
Energy-based Automated Model Evaluation
Authors: Ru Peng, Heming Zou, Haobo Wang, Yawen Zeng, Zenan Huang, Junbo Zhao
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.12689
Pdf link: https://arxiv.org/pdf/2401.12689
Abstract The conventional evaluation protocols on machine learning models rely heavily on a labeled, i.i.d-assumed testing dataset, which is not often present in real world applications. The Automated Model Evaluation (AutoEval) shows an alternative to this traditional workflow, by forming a proximal prediction pipeline of the testing performance without the presence of ground-truth labels. Despite its recent successes, the AutoEval frameworks still suffer from an overconfidence issue, substantial storage and computational cost. In that regard, we propose a novel measure -- Meta-Distribution Energy (MDE) -- that allows the AutoEval framework to be both more efficient and effective. The core of the MDE is to establish a meta-distribution statistic, on the information (energy) associated with individual samples, then offer a smoother representation enabled by energy-based learning. We further provide our theoretical insights by connecting the MDE with the classification loss. We provide extensive experiments across modalities, datasets and different architectural backbones to validate MDE's validity, together with its superiority compared with prior approaches. We also prove MDE's versatility by showing its seamless integration with large-scale models, and easy adaption to learning scenarios with noisy- or imbalanced- labels.
Keyword: multi-modal

There is no result

Keyword: multimodal

There is no result

PanagiotisFytas / get-daily-arxiv-noti

14 New submissions for Wed, 24 Jan 24 #447

Keyword: alignment

Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment

Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Gradient Flow of Energy: A General and Efficient Approach for Entity Alignment Decoding

Red Teaming Visual Language Models

Keyword: aligning

Keyword: align

The Ethics of Interaction: Mitigating Security Threats in LLMs

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

Keyword: vision language

Keyword: vision-language

The Neglected Tails of Vision-Language Models

Keyword: language-vision

Keyword: phrase-grounding

Keyword: phrase grounding

Keyword: reference expression comprehension

Keyword: chest

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding

Keyword: x-ray

Keyword: clinical

Keyword: biomedical

Keyword: radiology

Keyword: radiography

Keyword: medical

Keyword: active-learning

Keyword: active learning

Keyword: chexpert

Keyword: vision

Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data

Keyword: visual

Development of an NLP-driven computer-based test guide for visually impaired students

Keyword: visio-linguistic

Keyword: cross-modal

Keyword: modality

Keyword: modalities

LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

Energy-based Automated Model Evaluation

Keyword: multi-modal

Keyword: multimodal