14 New submissions for Thu, 11 Apr 24

Keyword: alignment

Less is More for Improving Automatic Evaluation of Factual Consistency

Authors: Tong Wang, Ninad Kulkarni, Yanjun Qi
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.06579
Pdf link: https://arxiv.org/pdf/2404.06579
Abstract Assessing the factual consistency of automatically generated texts in relation to source context is crucial for developing reliable natural language generation applications. Recent literature proposes AlignScore which uses a unified alignment model to evaluate factual consistency and substantially outperforms previous methods across many benchmark tasks. In this paper, we take a closer look of datasets used in AlignScore and uncover an unexpected finding: utilizing a smaller number of data points can actually improve performance. We process the original AlignScore training dataset to remove noise, augment with robustness-enhanced samples, and utilize a subset comprising 10\% of the data to train an improved factual consistency evaluation model, we call LIM-RA (Less Is More for Robust AlignScore). LIM-RA demonstrates superior performance, consistently outperforming AlignScore and other strong baselines like ChatGPT across four benchmarks (two utilizing traditional natural language generation datasets and two focused on large language model outputs). Our experiments show that LIM-RA achieves the highest score on 24 of the 33 test datasets, while staying competitive on the rest, establishing the new state-of-the-art benchmarks.
MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education
Authors: Murong Yue, Wijdane Mifdal, Yixuan Zhang, Jennifer Suh, Ziyu Yao
Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2404.06711
Pdf link: https://arxiv.org/pdf/2404.06711
Abstract Mathematical modeling (MM) is considered a fundamental skill for students in STEM disciplines. Practicing the MM skill is often the most effective when students can engage in group discussion and collaborative problem-solving. However, due to unevenly distributed teachers and educational resources needed to monitor such group activities, students do not always receive equal opportunities for this practice. Excitingly, large language models (LLMs) have recently demonstrated strong capability in both modeling mathematical problems and simulating characters with different traits and properties. Drawing inspiration from the advancement of LLMs, in this work, we present MATHVC, the very first LLM-powered virtual classroom containing multiple LLM-simulated student characters, with whom a human student can practice their MM skill. To encourage each LLM character's behaviors to be aligned with their specified math-relevant properties (termed "characteristics alignment") and the overall conversational procedure to be close to an authentic student MM discussion (termed "conversational procedural alignment"), we proposed three innovations: integrating MM domain knowledge into the simulation, defining a symbolic schema as the ground for character simulation, and designing a meta planner at the platform level to drive the conversational procedure. Through experiments and ablation studies, we confirmed the effectiveness of our simulation approach and showed the promise for MATHVC to benefit real-life students in the future.
GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism
Authors: Shuzhou Yuan, Michael Färber
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.06911
Pdf link: https://arxiv.org/pdf/2404.06911
Abstract Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) require cumbersome processes for integration into PLMs. In this work, we propose a novel graph-guided self-attention mechanism, GraSAME. GraSAME seamlessly incorporates token-level structural information into PLMs without necessitating additional alignment or concatenation efforts. As an end-to-end, lightweight multimodal module, GraSAME follows a multi-task learning strategy and effectively bridges the gap between graph and textual modalities, facilitating dynamic interactions between GNNs and PLMs. Our experiments on the graph-to-text generation task demonstrate that GraSAME outperforms baseline models and achieves results comparable to state-of-the-art (SOTA) models on WebNLG datasets. Furthermore, compared to SOTA models, GraSAME eliminates the need for extra pre-training tasks to adjust graph inputs and reduces the number of trainable parameters by over 100 million.
UMBRAE: Unified Multimodal Decoding of Brain Signals
Authors: Weihao Xia, Raoul de Charette, Cengiz Öztireli, Jing-Hao Xue
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.07202
Pdf link: https://arxiv.org/pdf/2404.07202
Abstract We address prevailing challenges of the brain-powered research, departing from the observation that the literature hardly recover accurate spatial information and require subject-specific models. To address these challenges, we propose UMBRAE, a unified multimodal decoding of brain signals. First, to extract instance-level conceptual and spatial details from neural signals, we introduce an efficient universal brain encoder for multimodal-brain alignment and recover object descriptions at multiple levels of granularity from subsequent multimodal large language model (MLLM). Second, we introduce a cross-subject training strategy mapping subject-specific features to a common feature space. This allows a model to be trained on multiple subjects without extra resources, even yielding superior results compared to subject-specific models. Further, we demonstrate this supports weakly-supervised adaptation to new subjects, with only a fraction of the total training data. Experiments demonstrate that UMBRAE not only achieves superior results in the newly introduced tasks but also outperforms methods in well established tasks. To assess our method, we construct and share with the community a comprehensive brain understanding benchmark BrainHub. Our code and benchmark are available at https://weihaox.github.io/UMBRAE.
Keyword: aligning

Global Contrastive Training for Multimodal Electronic Health Records with Language Supervision
Authors: Yingbo Ma, Suraj Kolla, Zhenhong Hu, Dhruv Kaliraman, Victoria Nolan, Ziyuan Guan, Yuanfang Ren, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Jeremy A. Balch, Tyler J. Loftus, Parisa Rashidi, Azra Bihorac, Benjamin Shickel
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.06723
Pdf link: https://arxiv.org/pdf/2404.06723
Abstract Modern electronic health records (EHRs) hold immense promise in tracking personalized patient health trajectories through sequential deep learning, owing to their extensive breadth, scale, and temporal granularity. Nonetheless, how to effectively leverage multiple modalities from EHRs poses significant challenges, given its complex characteristics such as high dimensionality, multimodality, sparsity, varied recording frequencies, and temporal irregularities. To this end, this paper introduces a novel multimodal contrastive learning framework, specifically focusing on medical time series and clinical notes. To tackle the challenge of sparsity and irregular time intervals in medical time series, the framework integrates temporal cross-attention transformers with a dynamic embedding and tokenization scheme for learning multimodal feature representations. To harness the interconnected relationships between medical time series and clinical notes, the framework equips a global contrastive loss, aligning a patient's multimodal feature representations with the corresponding discharge summaries. Since discharge summaries uniquely pertain to individual patients and represent a holistic view of the patient's hospital stay, machine learning models are led to learn discriminative multimodal features via global contrasting. Extensive experiments with a real-world EHR dataset demonstrated that our framework outperformed state-of-the-art approaches on the exemplar task of predicting the occurrence of nine postoperative complications for more than 120,000 major inpatient surgeries using multimodal data from UF health system split among three hospitals (UF Health Gainesville, UF Health Jacksonville, and UF Health Jacksonville-North).
Keyword: align

WordDecipher: Enhancing Digital Workspace Communication with Explainable AI for Non-native English Speakers
Authors: Yuexi Chen, Zhicheng Liu
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.07005
Pdf link: https://arxiv.org/pdf/2404.07005
Abstract Non-native English speakers (NNES) face challenges in digital workspace communication (e.g., emails, Slack messages), often inadvertently translating expressions from their native languages, which can lead to awkward or incorrect usage. Current AI-assisted writing tools are equipped with fluency enhancement and rewriting suggestions; however, NNES may struggle to grasp the subtleties among various expressions, making it challenging to choose the one that accurately reflects their intent. Such challenges are exacerbated in high-stake text-based communications, where the absence of non-verbal cues heightens the risk of misinterpretation. By leveraging the latest advancements in large language models (LLM) and word embeddings, we propose WordDecipher, an explainable AI-assisted writing tool to enhance digital workspace communication for NNES. WordDecipher not only identifies the perceived social intentions detected in users' writing, but also generates rewriting suggestions aligned with users' intended messages, either numerically or by inferring from users' writing in their native language. Then, WordDecipher provides an overview of nuances to help NNES make selections. Through a usage scenario, we demonstrate how WordDecipher can significantly enhance an NNES's ability to communicate her request, showcasing its potential to transform workspace communication for NNES.
Keyword: vision language

There is no result

Keyword: vision-language

There is no result

Keyword: language-vision

There is no result

Keyword: phrase-grounding

There is no result

Keyword: phrase grounding

There is no result

Keyword: reference expression comprehension

There is no result

Keyword: chest

There is no result

Keyword: x-ray

There is no result

Keyword: clinical

Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology
Authors: Shashi Kant Gupta, Aditya Basu, Bradley Taylor, Anai Kothari, Hrituraj Singh
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.06680
Pdf link: https://arxiv.org/pdf/2404.06680
Abstract Retrieving information from EHR systems is essential for answering specific questions about patient journeys and improving the delivery of clinical care. Despite this fact, most EHR systems still rely on keyword-based searches. With the advent of generative large language models (LLMs), retrieving information can lead to better search and summarization capabilities. Such retrievers can also feed Retrieval-augmented generation (RAG) pipelines to answer any query. However, the task of retrieving information from EHR real-world clinical data contained within EHR systems in order to solve several downstream use cases is challenging due to the difficulty in creating query-document support pairs. We provide a blueprint for creating such datasets in an affordable manner using large language models. Our method results in a retriever that is 30-50 F-1 points better than propriety counterparts such as Ada and Mistral for oncology data elements. We further compare our model, called Onco-Retriever, against fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation on real-world EHR data along with latency analysis of the different models and provide a path forward for healthcare organizations to build domain-specific retrievers.
Keyword: biomedical

There is no result

Keyword: radiology

There is no result

Keyword: radiography

There is no result

Keyword: medical

There is no result

Keyword: active-learning

There is no result

Keyword: active learning

There is no result

Keyword: chexpert

There is no result

Keyword: vision

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge
Authors: Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2404.06664
Pdf link: https://arxiv.org/pdf/2404.06664
Abstract Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space
Authors: Jianxiang Xiang, Zhenhua Liu, Haodong Liu, Yin Bai, Jia Cheng, Wenliang Chen
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.06760
Pdf link: https://arxiv.org/pdf/2404.06760
Abstract In real-life conversations, the content is diverse, and there exists the one-to-many problem that requires diverse generation. Previous studies attempted to introduce discrete or Gaussian-based continuous latent variables to address the one-to-many problem, but the diversity is limited. Recently, diffusion models have made breakthroughs in computer vision, and some attempts have been made in natural language processing. In this paper, we propose DiffusionDialog, a novel approach to enhance the diversity of dialogue generation with the help of diffusion model. In our approach, we introduce continuous latent variables into the diffusion model. The problem of using latent variables in the dialog task is how to build both an effective prior of the latent space and an inferring process to obtain the proper latent given the context. By combining the encoder and latent-based diffusion model, we encode the response's latent representation in a continuous space as the prior, instead of fixed Gaussian distribution or simply discrete ones. We then infer the latent by denoising step by step with the diffusion model. The experimental results show that our model greatly enhances the diversity of dialog responses while maintaining coherence. Furthermore, in further analysis, we find that our diffusion model achieves high inference efficiency, which is the main challenge of applying diffusion models in natural language processing.
GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications
Authors: Shishir G. Patil, Tianjun Zhang, Vivian Fang, Noppapon C., Roy Huang, Aaron Hao, Martin Casado, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.06921
Pdf link: https://arxiv.org/pdf/2404.06921
Abstract Large Language Models (LLMs) are evolving beyond their classical role of providing information within dialogue systems to actively engaging with tools and performing actions on real-world applications and services. Today, humans verify the correctness and appropriateness of the LLM-generated outputs (e.g., code, functions, or actions) before putting them into real-world execution. This poses significant challenges as code comprehension is well known to be notoriously difficult. In this paper, we study how humans can efficiently collaborate with, delegate to, and supervise autonomous LLMs in the future. We argue that in many cases, "post-facto validation" - verifying the correctness of a proposed action after seeing the output - is much easier than the aforementioned "pre-facto validation" setting. The core concept behind enabling a post-facto validation system is the integration of an intuitive undo feature, and establishing a damage confinement for the LLM-generated actions as effective strategies to mitigate the associated risks. Using this, a human can now either revert the effect of an LLM-generated output or be confident that the potential risk is bounded. We believe this is critical to unlock the potential for LLM agents to interact with applications and services with limited (post-facto) human involvement. We describe the design and implementation of our open-source runtime for executing LLM actions, Gorilla Execution Engine (GoEX), and present open research questions towards realizing the goal of LLMs and applications interacting with each other with minimal human supervision. We release GoEX at https://github.com/ShishirPatil/gorilla/.
From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications
Authors: Yongqiang Ma, Lizhi Qin, Jiawei Liu, Yangyang Kang, Yue Zhang, Wei Lu, Xiaozhong Liu, Qikai Cheng
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2404.07108
Pdf link: https://arxiv.org/pdf/2404.07108
Abstract Evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task,Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.
Keyword: visual

Towards Robustness of Text-to-Visualization Translation against Lexical and Phrasal Variability
Authors: Jinwei Lu, Yuanfeng Song, Haodi Zhang, Chen Zhang, Raymond Chi-Wing Wong
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.07135
Pdf link: https://arxiv.org/pdf/2404.07135
Abstract Text-to-Vis is an emerging task in the natural language processing (NLP) area that aims to automatically generate data visualizations from natural language questions (NLQs). Despite their progress, existing text-to-vis models often heavily rely on lexical matching between words in the questions and tokens in data schemas. This overreliance on lexical matching may lead to a diminished level of model robustness against input variations. In this study, we thoroughly examine the robustness of current text-to-vis models, an area that has not previously been explored. In particular, we construct the first robustness dataset nvBench-Rob, which contains diverse lexical and phrasal variations based on the original text-to-vis benchmark nvBench. Then, we found that the performance of existing text-to-vis models on this new dataset dramatically drops, implying that these methods exhibit inadequate robustness overall. Finally, we propose a novel framework based on Retrieval-Augmented Generation (RAG) technique, named GRED, specifically designed to address input perturbations in these two variants. The framework consists of three parts: NLQ-Retrieval Generator, Visualization Query-Retrieval Retuner and Annotation-based Debugger, which are used to tackle the challenges posed by natural language variants, programming style differences and data schema variants, respectively. Extensive experimental evaluations show that, compared to the state-of-the-art model RGVisNet in the Text-to-Vis field, RGDR performs better in terms of model robustness, with a 32% increase in accuracy on the proposed nvBench-Rob dataset.
SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
Authors: Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2404.06666
Pdf link: https://arxiv.org/pdf/2404.06666
Abstract Text-to-image (T2I) models, such as Stable Diffusion, have exhibited remarkable performance in generating high-quality images from text descriptions in recent years. However, text-to-image models may be tricked into generating not-safe-for-work (NSFW) content, particularly in sexual scenarios. Existing countermeasures mostly focus on filtering inappropriate inputs and outputs, or suppressing improper text embeddings, which can block explicit NSFW-related content (e.g., naked or sexy) but may still be vulnerable to adversarial prompts inputs that appear innocent but are ill-intended. In this paper, we present SafeGen, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate unsafe visual representations from the model regardless of the text input. In this way, the text-to-image model is resistant to adversarial prompts since unsafe visual representations are obstructed from within. Extensive experiments conducted on four datasets demonstrate SafeGen's effectiveness in mitigating unsafe content generation while preserving the high-fidelity of benign images. SafeGen outperforms eight state-of-the-art baseline methods and achieves 99.1% sexual content removal performance. Furthermore, our constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.
Keyword: visio-linguistic

There is no result

Keyword: cross-modal

There is no result

Keyword: modality

There is no result

Keyword: modalities

There is no result

Keyword: multi-modal

Leveraging Interesting Facts to Enhance User Engagement with Conversational Interfaces
Authors: Nikhita Vedula, Giuseppe Castellucci, Eugene Agichtein, Oleg Rokhlenko, Shervin Malmasi
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.06659
Pdf link: https://arxiv.org/pdf/2404.06659
Abstract Conversational Task Assistants (CTAs) guide users in performing a multitude of activities, such as making recipes. However, ensuring that interactions remain engaging, interesting, and enjoyable for CTA users is not trivial, especially for time-consuming or challenging tasks. Grounded in psychological theories of human interest, we propose to engage users with contextual and interesting statements or facts during interactions with a multi-modal CTA, to reduce fatigue and task abandonment before a task is complete. To operationalize this idea, we train a high-performing classifier (82% F1-score) to automatically identify relevant and interesting facts for users. We use it to create an annotated dataset of task-specific interesting facts for the domain of cooking. Finally, we design and validate a dialogue policy to incorporate the identified relevant and interesting facts into a conversation, to improve user engagement and task completion. Live testing on a leading multi-modal voice assistant shows that 66% of the presented facts were received positively, leading to a 40% gain in the user satisfaction rating, and a 37% increase in conversation length. These findings emphasize that strategically incorporating interesting facts into the CTA experience can promote real-world user participation for guided task interactions.
Keyword: multimodal

There is no result

PanagiotisFytas / get-daily-arxiv-noti

14 New submissions for Thu, 11 Apr 24 #555

Keyword: alignment

Less is More for Improving Automatic Evaluation of Factual Consistency

MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism

UMBRAE: Unified Multimodal Decoding of Brain Signals

Keyword: aligning

Global Contrastive Training for Multimodal Electronic Health Records with Language Supervision

Keyword: align

WordDecipher: Enhancing Digital Workspace Communication with Explainable AI for Non-native English Speakers

Keyword: vision language

Keyword: vision-language

Keyword: language-vision

Keyword: phrase-grounding

Keyword: phrase grounding

Keyword: reference expression comprehension

Keyword: chest

Keyword: x-ray

Keyword: clinical

Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology

Keyword: biomedical

Keyword: radiology

Keyword: radiography

Keyword: medical

Keyword: active-learning

Keyword: active learning

Keyword: chexpert

Keyword: vision

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge

DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space

GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

Keyword: visual

Towards Robustness of Text-to-Visualization Translation against Lexical and Phrasal Variability

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Keyword: visio-linguistic

Keyword: cross-modal

Keyword: modality

Keyword: modalities

Keyword: multi-modal

Leveraging Interesting Facts to Enhance User Engagement with Conversational Interfaces

Keyword: multimodal