35 New submissions for Tue, 16 Apr 24

Keyword: alignment

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Authors: Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo Li
Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08676
Pdf link: https://arxiv.org/pdf/2404.08676
Abstract When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs
Authors: Ahmed Agiza, Mohamed Mostagir, Sherief Reda
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08699
Pdf link: https://arxiv.org/pdf/2404.08699
Abstract In an era where language models are increasingly integrated into decision-making and communication, understanding the biases within Large Language Models (LLMs) becomes imperative, especially when these models are applied in the economic and political domains. This work investigates the impact of fine-tuning and data selection on economic and political biases in LLM. We explore the methodological aspects of biasing LLMs towards specific ideologies, mindful of the biases that arise from their extensive training on diverse datasets. Our approach, distinct from earlier efforts that either focus on smaller models or entail resource-intensive pre-training, employs Parameter-Efficient Fine-Tuning (PEFT) techniques. These techniques allow for the alignment of LLMs with targeted ideologies by modifying a small subset of parameters. We introduce a systematic method for dataset selection, annotation, and instruction tuning, and we assess its effectiveness through both quantitative and qualitative evaluations. Our work analyzes the potential of embedding specific biases into LLMs and contributes to the dialogue on the ethical application of AI, highlighting the importance of deploying AI in a manner that aligns with societal values.
The Generation Gap:Exploring Age Bias in Large Language Models
Authors: Siyang Liu, Trish Maturi, Siqi Shen, Rada Mihalcea
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08760
Pdf link: https://arxiv.org/pdf/2404.08760
Abstract In this paper, we explore the alignment of values in Large Language Models (LLMs) with specific age groups, leveraging data from the World Value Survey across thirteen categories. Through a diverse set of prompts tailored to ensure response robustness, we find a general inclination of LLM values towards younger demographics. Additionally, we explore the impact of incorporating age identity information in prompts and observe challenges in mitigating value discrepancies with different age cohorts. Our findings highlight the age bias in LLMs and provide insights for future work.
Compass: Large Multilingual Language Model for South-east Asia
Authors: Sophia Maria
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09220
Pdf link: https://arxiv.org/pdf/2404.09220
Abstract Large language models have exhibited significant proficiency in languages endowed with extensive linguistic resources, such as English and Chinese. Nevertheless, their effectiveness notably diminishes when applied to languages characterized by limited linguistic resources, particularly within the Southeast Asian linguistic landscape, such as Indonesian. The scarcity of linguistic resources for these languages presents challenges associated with inadequate training, restricted vocabulary coverage, and challenging evaluation processes. In response to these exigencies, we have introduced CompassLLM, a large multilingual model specifically tailored for Southeast Asian languages, with the primary aim of supporting the developmental requirements of Shopee. Our methodology encompasses several key strategies. To progressively enhance multilingual proficiencies, we implemented a multi-stage pre-training strategy integrated with curriculum learning, gradually intensifying the focus on low-resource languages. Concurrently, to better accommodate low-resource human instructions, we curated and generated a repository of high-quality multilingual human instructions, culminating the CompassLLM-SFT model through supervised instruction fine-tuning. Finally, to reinforce the model's alignment with human preference behaviors, we have embraced the principle of Direct Preference Optimization (DPO) to obtain CompassLLM-DPO model. Preliminary evaluation of the CompassLLM model yields promising results, with our model surpassing benchmark models like Vicuna-7b-v1.5, Sealion, Falcon and SeaLLM, across diverse evaluation tasks, as verified through both automated and human-driven assessments. Notably, our model exhibits its superior performance in South-east Asia languages, such as Indonesian language.
Bridging the Gap between Different Vocabularies for LLM Ensemble
Authors: Yangyifan Xu, Jinliang Lu, Jiajun Zhang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09492
Pdf link: https://arxiv.org/pdf/2404.09492
Abstract Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement.
Impact of Preference Noise on the Alignment Performance of Generative Language Models
Authors: Yang Gao, Dana Alon, Donald Metzler
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09824
Pdf link: https://arxiv.org/pdf/2404.09824
Abstract A key requirement in developing Generative Language Models (GLMs) is to have their values aligned with human values. Preference-based alignment is a widely used paradigm for this purpose, in which preferences over generation pairs are first elicited from human annotators or AI systems, and then fed into some alignment techniques, e.g., Direct Preference Optimization. However, a substantial percent (20 - 40%) of the preference pairs used in GLM alignment are noisy, and it remains unclear how the noise affects the alignment performance and how to mitigate its negative impact. In this paper, we propose a framework to inject desirable amounts and types of noise to the preferences, and systematically study the impact of preference noise on the alignment performance in two tasks (summarization and dialogue generation). We find that the alignment performance can be highly sensitive to the noise rates in the preference data: e.g., a 10 percentage points (pp) increase of the noise rate can lead to 30 pp drop in the alignment performance (in win rate). To mitigate the impact of noise, confidence-based data filtering shows significant benefit when certain types of noise are present. We hope our work can help the community better understand and mitigate the impact of preference noise in GLM alignment.
Aligning LLMs for FL-free Program Repair
Authors: Junjielong Xu, Ying Fu, Shin Hwei Tan, Pinjia He
Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08877
Pdf link: https://arxiv.org/pdf/2404.08877
Abstract Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs are capable of locating and repairing bugs end-to-end when using the related artifacts (e.g., test cases) as input, existing methods regard them as separate tasks and ask LLMs to generate patches at fixed locations. This restriction hinders LLMs from exploring potential patches beyond the given locations. In this paper, we investigate a new approach to adapt LLMs to program repair. Our core insight is that LLM's APR capability can be greatly improved by simply aligning the output to their training objective and allowing them to refine the whole program without first performing fault localization. Based on this insight, we designed D4C, a straightforward prompting framework for APR. D4C can repair 180 bugs correctly in Defects4J, with each patch being sampled only 10 times. This surpasses the SOTA APR methods with perfect fault localization by 10% and reduces the patch sampling number by 90%. Our findings reveal that (1) objective alignment is crucial for fully exploiting LLM's pre-trained capability, and (2) replacing the traditional localize-then-repair workflow with direct debugging is more effective for LLM-based APR methods. Thus, we believe this paper introduces a new mindset for harnessing LLMs in APR.
Learn Your Reference Model for Real Good Alignment
Authors: Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09656
Pdf link: https://arxiv.org/pdf/2404.09656
Abstract The complexity of the alignment problem stems from the fact that existing methods are unstable. Researchers continuously invent various tricks to address this shortcoming. For instance, in the fundamental Reinforcement Learning From Human Feedback (RLHF) technique of Language Model alignment, in addition to reward maximization, the Kullback-Leibler divergence between the trainable policy and the SFT policy is minimized. This addition prevents the model from being overfitted to the Reward Model (RM) and generating texts that are out-of-domain for the RM. The Direct Preference Optimization (DPO) method reformulates the optimization task of RLHF and eliminates the Reward Model while tacitly maintaining the requirement for the policy to be close to the SFT policy. In our paper, we argue that this implicit limitation in the DPO method leads to sub-optimal results. We propose a new method called Trust Region DPO (TR-DPO), which updates the reference policy during training. With such a straightforward update, we demonstrate the effectiveness of TR-DPO against DPO on the Anthropic HH and TLDR datasets. We show that TR-DPO outperforms DPO by up to 19%, measured by automatic evaluation with GPT-4. The new alignment approach that we propose allows us to improve the quality of models across several parameters at once, such as coherence, correctness, level of detail, helpfulness, and harmlessness.
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Authors: Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2404.09932
Pdf link: https://arxiv.org/pdf/2404.09932
Abstract This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
Keyword: aligning

Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive Knowledge
Authors: Seyed Mahed Mousavi, Simone Alghisi, Giuseppe Riccardi
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08700
Pdf link: https://arxiv.org/pdf/2404.08700
Abstract We study the appropriateness of Large Language Models (LLMs) as knowledge repositories. We focus on the challenge of maintaining LLMs' factual knowledge up-to-date over time. Motivated by the lack of studies on identifying outdated knowledge within LLMs, we design and develop a dynamic benchmark with up-to-date ground truth answers for each target factual question. We evaluate eighteen open-source and closed-source state-of-the-art LLMs on time-sensitive knowledge retrieved in real-time from Wikidata. We select time-sensitive domain facts in politics, sports, and organizations, and estimate the recency of the information learned by the model during pre-training\fine-tuning. In the second contribution, we evaluate the effectiveness of knowledge editing methods for aligning LLMs with up-to-date factual knowledge and compare their performance with Retrieval Augmented Generation. The dynamic benchmark is designed to be used as-is to assess LLMs's up-to-dateness, as well as to be extended to other domains by sharing the code, the dataset, as well as evaluation and visualization scripts.
Keyword: align

The Comparison of Translationese in Machine Translation and Human Transation in terms of Translation Relations
Authors: Fan Zhou
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.08661
Pdf link: https://arxiv.org/pdf/2404.08661
Abstract This study explores the distinctions between neural machine translation (NMT) and human translation (HT) through the lens of translation relations. It benchmarks HT to assess the translation techniques produced by an NMT system and aims to address three key research questions: the differences in overall translation relations between NMT and HT, how each utilizes non-literal translation techniques, and the variations in factors influencing their use of specific non-literal techniques. The research employs two parallel corpora, each spanning nine genres with the same source texts with one translated by NMT and the other by humans. Translation relations in these corpora are manually annotated on aligned pairs, enabling a comparative analysis that draws on linguistic insights, including semantic and syntactic nuances such as hypernyms and alterations in part-of-speech tagging. The results indicate that NMT relies on literal translation significantly more than HT across genres. While NMT performs comparably to HT in employing syntactic non-literal translation techniques, it falls behind in semantic-level performance.
Real-world Instance-specific Image Goal Navigation for Service Robots: Bridging the Domain Gap with Contrastive Learning
Authors: Taichi Sakaguchi, Akira Taniguchi, Yoshinobu Hagiwara, Lotfi El Hafi, Shoichi Hasegawa, Tadahiro Taniguchi
Subjects: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09645
Pdf link: https://arxiv.org/pdf/2404.09645
Abstract Improving instance-specific image goal navigation (InstanceImageNav), which locates the identical object in a real-world environment from a query image, is essential for robotic systems to assist users in finding desired objects. The challenge lies in the domain gap between low-quality images observed by the moving robot, characterized by motion blur and low-resolution, and high-quality query images provided by the user. Such domain gaps could significantly reduce the task success rate but have not been the focus of previous work. To address this, we propose a novel method called Few-shot Cross-quality Instance-aware Adaptation (CrossIA), which employs contrastive learning with an instance classifier to align features between massive low- and few high-quality images. This approach effectively reduces the domain gap by bringing the latent representations of cross-quality images closer on an instance basis. Additionally, the system integrates an object image collection with a pre-trained deblurring model to enhance the observed image quality. Our method fine-tunes the SimSiam model, pre-trained on ImageNet, using CrossIA. We evaluated our method's effectiveness through an InstanceImageNav task with 20 different types of instances, where the robot identifies the same instance in a real-world environment as a high-quality query image. Our experiments showed that our method improves the task success rate by up to three times compared to the baseline, a conventional approach based on SuperGlue. These findings highlight the potential of leveraging contrastive learning and image enhancement techniques to bridge the domain gap and improve object localization in robotic applications. The project website is https://emergentsystemlabstudent.github.io/DomainBridgingNav/.
Progressive Knowledge Graph Completion
Authors: Jiayi Li, Ruilin Luo, Jiaqi Sun, Jing Xiao, Yujiu Yang
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09897
Pdf link: https://arxiv.org/pdf/2404.09897
Abstract Knowledge Graph Completion (KGC) has emerged as a promising solution to address the issue of incompleteness within Knowledge Graphs (KGs). Traditional KGC research primarily centers on triple classification and link prediction. Nevertheless, we contend that these tasks do not align well with real-world scenarios and merely serve as surrogate benchmarks. In this paper, we investigate three crucial processes relevant to real-world construction scenarios: (a) the verification process, which arises from the necessity and limitations of human verifiers; (b) the mining process, which identifies the most promising candidates for verification; and (c) the training process, which harnesses verified data for subsequent utilization; in order to achieve a transition toward more realistic challenges. By integrating these three processes, we introduce the Progressive Knowledge Graph Completion (PKGC) task, which simulates the gradual completion of KGs in real-world scenarios. Furthermore, to expedite PKGC processing, we propose two acceleration modules: Optimized Top-$k$ algorithm and Semantic Validity Filter. These modules significantly enhance the efficiency of the mining procedure. Our experiments demonstrate that performance in link prediction does not accurately reflect performance in PKGC. A more in-depth analysis reveals the key factors influencing the results and provides potential directions for future research.
Keyword: vision language

There is no result

Keyword: vision-language

AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning
Authors: Yuwei Tang, Zhenyi Lin, Qilong Wang, Pengfei Zhu, Qinghua Hu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08958
Pdf link: https://arxiv.org/pdf/2404.08958
Abstract Recently, pre-trained vision-language models (e.g., CLIP) have shown great potential in few-shot learning and attracted a lot of research interest. Although efforts have been made to improve few-shot ability of CLIP, key factors on the effectiveness of existing methods have not been well studied, limiting further exploration of CLIP's potential in few-shot learning. In this paper, we first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias, which encourages us to learn an effective logit bias for further improving performance of CLIP-based few-shot learning methods. To this end, we disassemble three key components involved in computation of logit bias (i.e., logit features, logit predictor, and logit fusion) and empirically analyze the effect on performance of few-shot classification. Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification. Specifically, our AMU-Tuning predicts logit bias by exploiting the appropriate $\underline{\textbf{A}}$uxiliary features, which are fed into an efficient feature-initialized linear classifier with $\underline{\textbf{M}}$ulti-branch training. Finally, an $\underline{\textbf{U}}$ncertainty-based fusion is developed to incorporate logit bias into CLIP for few-shot classification. The experiments are conducted on several widely used benchmarks, and the results show AMU-Tuning clearly outperforms its counterparts while achieving state-of-the-art performance of CLIP-based few-shot learning without bells and whistles.
Keyword: language-vision

There is no result

Keyword: phrase-grounding

There is no result

Keyword: phrase grounding

There is no result

Keyword: reference expression comprehension

There is no result

Keyword: chest

There is no result

Keyword: x-ray

There is no result

Keyword: clinical

TLDR at SemEval-2024 Task 2: T5-generated clinical-Language summaries for DeBERTa Report Analysis
Authors: Spandan Das, Vinay Samuel, Shahriar Noroozizadeh
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09136
Pdf link: https://arxiv.org/pdf/2404.09136
Abstract This paper introduces novel methodologies for the Natural Language Inference for Clinical Trials (NLI4CT) task. We present TLDR (T5-generated clinical-Language summaries for DeBERTa Report Analysis) which incorporates T5-model generated premise summaries for improved entailment and contradiction analysis in clinical NLI tasks. This approach overcomes the challenges posed by small context windows and lengthy premises, leading to a substantial improvement in Macro F1 scores: a 0.184 increase over truncated premises. Our comprehensive experimental evaluation, including detailed error analysis and ablations, confirms the superiority of TLDR in achieving consistency and faithfulness in predictions against semantically altered inputs.
DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness
Authors: Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09206
Pdf link: https://arxiv.org/pdf/2404.09206
Abstract Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.
Keyword: biomedical

Optimal path for Biomedical Text Summarization Using Pointer GPT
Authors: Hyunkyung Han, Jaesik Choi
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08654
Pdf link: https://arxiv.org/pdf/2404.08654
Abstract Biomedical text summarization is a critical tool that enables clinicians to effectively ascertain patient status. Traditionally, text summarization has been accomplished with transformer models, which are capable of compressing long documents into brief summaries. However, transformer models are known to be among the most challenging natural language processing (NLP) tasks. Specifically, GPT models have a tendency to generate factual errors, lack context, and oversimplify words. To address these limitations, we replaced the attention mechanism in the GPT model with a pointer network. This modification was designed to preserve the core values of the original text during the summarization process. The effectiveness of the Pointer-GPT model was evaluated using the ROUGE score. The results demonstrated that Pointer-GPT outperformed the original GPT model. These findings suggest that pointer networks can be a valuable addition to EMR systems and can provide clinicians with more accurate and informative summaries of patient medical records. This research has the potential to usher in a new paradigm in EMR systems and to revolutionize the way that clinicians interact with patient medical records.
Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions
Authors: Taojun Hu, Xiao-Hua Zhou
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09135
Pdf link: https://arxiv.org/pdf/2404.09135
Abstract Natural Language Processing (NLP) is witnessing a remarkable breakthrough driven by the success of Large Language Models (LLMs). LLMs have gained significant attention across academia and industry for their versatile applications in text generation, question answering, and text summarization. As the landscape of NLP evolves with an increasing number of domain-specific LLMs employing diverse techniques and trained on various corpus, evaluating performance of these models becomes paramount. To quantify the performance, it's crucial to have a comprehensive grasp of existing metrics. Among the evaluation, metrics which quantifying the performance of LLMs play a pivotal role. This paper offers a comprehensive exploration of LLM evaluation from a metrics perspective, providing insights into the selection and interpretation of metrics currently in use. Our main goal is to elucidate their mathematical formulations and statistical interpretations. We shed light on the application of these metrics using recent Biomedical LLMs. Additionally, we offer a succinct comparison of these metrics, aiding researchers in selecting appropriate metrics for diverse tasks. The overarching goal is to furnish researchers with a pragmatic guide for effective LLM evaluation and metric selection, thereby advancing the understanding and application of these large language models.
Keyword: radiology

There is no result

Keyword: radiography

There is no result

Keyword: medical

Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions
Authors: Agasthya Gangavarapu
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08705
Pdf link: https://arxiv.org/pdf/2404.08705
Abstract Addressing the imminent shortfall of 10 million health workers by 2030, predominantly in Low- and Middle-Income Countries (LMICs), this paper introduces an innovative approach that harnesses the power of Large Language Models (LLMs) integrated with machine translation models. This solution is engineered to meet the unique needs of Community Health Workers (CHWs), overcoming language barriers, cultural sensitivities, and the limited availability of medical dialog datasets. I have crafted a model that not only boasts superior translation capabilities but also undergoes rigorous fine-tuning on open-source datasets to ensure medical accuracy and is equipped with comprehensive safety features to counteract the risks of misinformation. Featuring a modular design, this approach is specifically structured for swift adaptation across various linguistic and cultural contexts, utilizing open-source components to significantly reduce healthcare operational costs. This strategic innovation markedly improves the accessibility and quality of healthcare services by providing CHWs with contextually appropriate medical knowledge and diagnostic tools. This paper highlights the transformative impact of this context-aware LLM, underscoring its crucial role in addressing the global healthcare workforce deficit and propelling forward healthcare outcomes in LMICs.
MING-MOE: Enhancing Medical Multi-Task Learning in Large Language Models with Sparse Mixture of Low-Rank Adapter Experts
Authors: Yusheng Liao, Shuyang Jiang, Yu Wang, Yanfeng Wang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09027
Pdf link: https://arxiv.org/pdf/2404.09027
Abstract Large language models like ChatGPT have shown substantial progress in natural language understanding and generation, proving valuable across various disciplines, including the medical field. Despite advancements, challenges persist due to the complexity and diversity inherent in medical tasks which often require multi-task learning capabilities. Previous approaches, although beneficial, fall short in real-world applications because they necessitate task-specific annotations at inference time, limiting broader generalization. This paper introduces MING-MOE, a novel Mixture-of-Expert~(MOE)-based medical large language model designed to manage diverse and complex medical tasks without requiring task-specific annotations, thus enhancing its usability across extensive datasets. MING-MOE employs a Mixture of Low-Rank Adaptation (MoLoRA) technique, allowing for efficient parameter usage by maintaining base model parameters static while adapting through a minimal set of trainable parameters. We demonstrate that MING-MOE achieves state-of-the-art (SOTA) performance on over 20 medical tasks, illustrating a significant improvement over existing models. This approach not only extends the capabilities of medical language models but also improves inference efficiency.
Keyword: active-learning

There is no result

Keyword: active learning

Provable Interactive Learning with Hindsight Instruction Feedback
Authors: Dipendra Misra, Aldo Pacchiano, Robert E. Schapire
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.09123
Pdf link: https://arxiv.org/pdf/2404.09123
Abstract We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight instruction where a teacher provides an instruction that is most suitable for the agent's generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent's response space. We then study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that its regret scales as $\sqrt{T}$ where $T$ is the number of rounds and depends on the intrinsic rank but does not depend on the size of the agent's response space. We provide experiments in two domains showing that LORIL outperforms baselines even when the low-rank assumption is violated.
Keyword: chexpert

There is no result

Keyword: vision

Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles
Authors: Abhijnan Nath, Huma Jamil, Shafiuddin Rehan Ahmed, George Baker, Rahul Ghosh, James H. Martin, Nathaniel Blanchard, Nikhil Krishnaswamy
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.08949
Pdf link: https://arxiv.org/pdf/2404.08949
Abstract Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models. As existing ECR benchmark datasets rarely provide images for all event mentions, we augment the popular ECB+ dataset with event-centric images scraped from the internet and generated using image diffusion models. We establish three methods that incorporate images and text for coreference: 1) a standard fused model with finetuning, 2) a novel linear mapping method without finetuning and 3) an ensembling approach based on splitting mention pairs by semantic and discourse-level difficulty. We evaluate on 2 datasets: the augmented ECB+, and AIDA Phase 1. Our ensemble systems using cross-modal linear mapping establish an upper limit (91.9 CoNLL F1) on ECB+ ECR performance given the preprocessing assumptions used, and establish a novel baseline on AIDA Phase 1. Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems, and highlight a need for more multimodal resources in the coreference resolution space.
Adapting Mental Health Prediction Tasks for Cross-lingual Learning via Meta-Training and In-context Learning with Large Language Model
Authors: Zita Lifelo, Huansheng Ning, Sahraoui Dhelim
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09045
Pdf link: https://arxiv.org/pdf/2404.09045
Abstract Timely identification is essential for the efficient handling of mental health illnesses such as depression. However, the current research fails to adequately address the prediction of mental health conditions from social media data in low-resource African languages like Swahili. This study introduces two distinct approaches utilising model-agnostic meta-learning and leveraging large language models (LLMs) to address this gap. Experiments are conducted on three datasets translated to low-resource language and applied to four mental health tasks, which include stress, depression, depression severity and suicidal ideation prediction. we first apply a meta-learning model with self-supervision, which results in improved model initialisation for rapid adaptation and cross-lingual transfer. The results show that our meta-trained model performs significantly better than standard fine-tuning methods, outperforming the baseline fine-tuning in macro F1 score with 18\% and 0.8\% over XLM-R and mBERT. In parallel, we use LLMs' in-context learning capabilities to assess their performance accuracy across the Swahili mental health prediction tasks by analysing different cross-lingual prompting approaches. Our analysis showed that Swahili prompts performed better than cross-lingual prompts but less than English prompts. Our findings show that in-context learning can be achieved through cross-lingual transfer through carefully crafted prompt templates with examples and instructions.
MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems
Authors: Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Jing Ma
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2404.09486
Pdf link: https://arxiv.org/pdf/2404.09486
Abstract Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning and mathematical tasks, there is little work on investigating whether these models can effectively interpret visual elements for code generation. To this end, we present MMCode, the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts. MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites, presenting significant challenges due to the extreme demand for reasoning abilities. Our experiment results show that current state-of-the-art models struggle to solve these problems. The results highlight the lack of powerful vision-code models, and we hope MMCode can serve as an inspiration for future works in this domain. The data and code are publicly available at https://github.com/happylkx/MMCode.
EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM
Authors: Henry Peng Zou, Gavin Heqing Yu, Ziwei Fan, Dan Bu, Han Liu, Peng Dai, Dongmei Jia, Cornelia Caragea
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08886
Pdf link: https://arxiv.org/pdf/2404.08886
Abstract In e-commerce, accurately extracting product attribute values from multimodal data is crucial for improving user experience and operational efficiency of retailers. However, previous approaches to multimodal attribute value extraction often struggle with implicit attribute values embedded in images or text, rely heavily on extensive labeled data, and can easily confuse similar attribute values. To address these issues, we introduce EIVEN, a data- and parameter-efficient generative framework that pioneers the use of multimodal LLM for implicit attribute value extraction. EIVEN leverages the rich inherent knowledge of a pre-trained LLM and vision encoder to reduce reliance on labeled data. We also introduce a novel Learning-by-Comparison technique to reduce model confusion by enforcing attribute value comparison and difference identification. Additionally, we construct initial open-source datasets for multimodal implicit attribute value extraction. Our extensive experiments reveal that EIVEN significantly outperforms existing methods in extracting implicit attribute values while requiring less labeled data.
State Space Model for New-Generation Network Alternative to Transformers: A Survey
Authors: Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, Ziwen Wang, Bo Jiang, Chenglong Li, Yaowei Wang, Yonghong Tian, Jin Tang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2404.09516
Pdf link: https://arxiv.org/pdf/2404.09516
Abstract In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years. In this paper, we give the first comprehensive review of these works and also provide experimental comparisons and analysis to better demonstrate the features and advantages of SSM. Specifically, we first give a detailed description of principles to help the readers quickly capture the key ideas of SSM. After that, we dive into the reviews of existing SSMs and their various applications, including natural language processing, computer vision, graph, multi-modal and multi-media, point cloud/event stream, time series data, and other domains. In addition, we give statistical comparisons and analysis of these models and hope it helps the readers to understand the effectiveness of different structures on various tasks. Then, we propose possible research points in this direction to better promote the development of the theoretical model and application of SSM. More related works will be continuously updated on the following GitHub: https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List.
Keyword: visual

Embodied Agents for Efficient Exploration and Smart Scene Description
Authors: Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2301.07150
Pdf link: https://arxiv.org/pdf/2301.07150
Abstract The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by highlighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.
JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
Authors: Yingchaojie Feng, Zhizhang Chen, Zhining Kang, Sijia Wang, Minfeng Zhu, Wei Zhang, Wei Chen
Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2404.08793
Pdf link: https://arxiv.org/pdf/2404.08793
Abstract The proliferation of large language models (LLMs) has underscored concerns regarding their security vulnerabilities, notably against jailbreak attacks, where adversaries design jailbreak prompts to circumvent safety mechanisms for potential misuse. Addressing these concerns necessitates a comprehensive analysis of jailbreak prompts to evaluate LLMs' defensive capabilities and identify potential weaknesses. However, the complexity of evaluating jailbreak performance and understanding prompt characteristics makes this analysis laborious. We collaborate with domain experts to characterize problems and propose an LLM-assisted framework to streamline the analysis process. It provides automatic jailbreak assessment to facilitate performance evaluation and support analysis of components and keywords in prompts. Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.
Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration
Authors: Chenwei Lin, Hanjia Lyu, Jiebo Luo, Xian Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09690
Pdf link: https://arxiv.org/pdf/2404.09690
Abstract The emergence of Large Multimodal Models (LMMs) marks a significant milestone in the development of artificial intelligence. Insurance, as a vast and complex discipline, involves a wide variety of data forms in its operational processes, including text, images, and videos, thereby giving rise to diverse multimodal tasks. Despite this, there has been limited systematic exploration of multimodal tasks specific to insurance, nor a thorough investigation into how LMMs can address these challenges. In this paper, we explore GPT-4V's capabilities in the insurance domain. We categorize multimodal tasks by focusing primarily on visual aspects based on types of insurance (e.g., auto, household/commercial property, health, and agricultural insurance) and insurance stages (e.g., risk assessment, risk monitoring, and claims processing). Our experiment reveals that GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating not only a robust understanding of multimodal content in the insurance domain but also a comprehensive knowledge of insurance scenarios. However, there are notable shortcomings: GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages. Through this work, we aim to bridge the insurance domain with cutting-edge LMM technology, facilitate interdisciplinary exchange and development, and provide a foundation for the continued advancement and evolution of future research endeavors.
Keyword: visio-linguistic

There is no result

Keyword: cross-modal

There is no result

Keyword: modality

There is no result

Keyword: modalities

There is no result

Keyword: multi-modal

PMG : Personalized Multimodal Generation with Large Language Models
Authors: Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, Xi Xiao
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.08677
Pdf link: https://arxiv.org/pdf/2404.08677
Abstract The emergence of large language models (LLMs) has revolutionized the capabilities of text comprehension and generation. Multi-modal generation attracts great attention from both the industry and academia, but there is little work on personalized generation, which has important applications such as recommender systems. This paper proposes the first method for personalized multimodal generation using LLMs, showcases its applications and validates its performance via an extensive experimental study on two datasets. The proposed method, Personalized Multimodal Generation (PMG for short) first converts user behaviors (e.g., clicks in recommender systems or conversations with a virtual assistant) into natural language to facilitate LLM understanding and extract user preference descriptions. Such user preferences are then fed into a generator, such as a multimodal LLM or diffusion model, to produce personalized content. To capture user preferences comprehensively and accurately, we propose to let the LLM output a combination of explicit keywords and implicit embeddings to represent user preferences. Then the combination of keywords and embeddings are used as prompts to condition the generator. We optimize a weighted sum of the accuracy and preference scores so that the generated content has a good balance between them. Compared to a baseline method without personalization, PMG has a significant improvement on personalization for up to 8% in terms of LPIPS while retaining the accuracy of generation.
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Authors: Quang Minh Dinh, Minh Khoi Ho, Anh Quan Dang, Hung Phong Tran
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09275
Pdf link: https://arxiv.org/pdf/2404.09275
Abstract Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems. Most existing methods only focus on locating traffic event segments, which severely lack descriptive details related to the behaviour and context of all the subjects of interest in the events. In this paper, we present TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego camera view. TrafficVLM models traffic video events at different levels of analysis, both spatially and temporally, and generates long fine-grained descriptions for the vehicle and pedestrian at different phases of the event. We also propose a conditional component for TrafficVLM to control the generation outputs and a multi-task fine-tuning paradigm to enhance TrafficVLM's learning capability. Experiments show that TrafficVLM performs well on both vehicle and overhead camera views. Our solution achieved outstanding results in Track 2 of the AI City Challenge 2024, ranking us third in the challenge standings. Our code is publicly available at https://github.com/quangminhdinh/TrafficVLM.
Keyword: multimodal

MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting
Authors: Avinash Anand, Janak Kapuriya, Apoorv Singh, Jay Saraf, Naman Lal, Astha Verma, Rushali Gupta, Rajiv Shah
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08704
Pdf link: https://arxiv.org/pdf/2404.08704
Abstract While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.
On Speculative Decoding for Multimodal Large Language Models
Authors: Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, Christopher Lott
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08856
Pdf link: https://arxiv.org/pdf/2404.08856
Abstract Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37$\times$ using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Authors: Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2404.09956
Pdf link: https://arxiv.org/pdf/2404.09956
Abstract Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.
MMInA: Benchmarking Multihop Multimodal Internet Agents
Authors: Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09992
Pdf link: https://arxiv.org/pdf/2404.09992
Abstract Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at https://mmina.cliangyu.com

PanagiotisFytas / get-daily-arxiv-noti

35 New submissions for Tue, 16 Apr 24 #561

Keyword: alignment

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs

The Generation Gap:Exploring Age Bias in Large Language Models

Compass: Large Multilingual Language Model for South-east Asia

Bridging the Gap between Different Vocabularies for LLM Ensemble

Impact of Preference Noise on the Alignment Performance of Generative Language Models

Aligning LLMs for FL-free Program Repair

Learn Your Reference Model for Real Good Alignment

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Keyword: aligning

Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive Knowledge

Keyword: align

The Comparison of Translationese in Machine Translation and Human Transation in terms of Translation Relations

Real-world Instance-specific Image Goal Navigation for Service Robots: Bridging the Domain Gap with Contrastive Learning

Progressive Knowledge Graph Completion

Keyword: vision language

Keyword: vision-language

AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning

Keyword: language-vision

Keyword: phrase-grounding

Keyword: phrase grounding

Keyword: reference expression comprehension

Keyword: chest

Keyword: x-ray

Keyword: clinical

TLDR at SemEval-2024 Task 2: T5-generated clinical-Language summaries for DeBERTa Report Analysis

DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness

Keyword: biomedical

Optimal path for Biomedical Text Summarization Using Pointer GPT

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions

Keyword: radiology

Keyword: radiography

Keyword: medical

Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions

MING-MOE: Enhancing Medical Multi-Task Learning in Large Language Models with Sparse Mixture of Low-Rank Adapter Experts

Keyword: active-learning

Keyword: active learning

Provable Interactive Learning with Hindsight Instruction Feedback

Keyword: chexpert

Keyword: vision

Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles

Adapting Mental Health Prediction Tasks for Cross-lingual Learning via Meta-Training and In-context Learning with Large Language Model

MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems

EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM

State Space Model for New-Generation Network Alternative to Transformers: A Survey

Keyword: visual

Embodied Agents for Efficient Exploration and Smart Scene Description

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration

Keyword: visio-linguistic

Keyword: cross-modal

Keyword: modality

Keyword: modalities

Keyword: multi-modal

PMG : Personalized Multimodal Generation with Large Language Models

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

Keyword: multimodal

MM-PhyQA: Multimodal Physics Question-Answering With Multi-Image CoT Prompting

On Speculative Decoding for Multimodal Large Language Models

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

MMInA: Benchmarking Multihop Multimodal Internet Agents