15 New submissions for Mon, 22 Jan 24

Keyword: alignment

Top in Chinese Data Processing: English Code Models

Authors: Linghan Zheng, Hui Liu, Xiaojun Lin, Jiayuan Dong, Yue Sheng, Gang Shi, Zhiwei Liu, Hongwei Chen
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.10286
Pdf link: https://arxiv.org/pdf/2401.10286
Abstract While the alignment between tasks and training corpora is a fundamental consensus in the application of language models, our series of experiments and the metrics we designed reveal that code-based Large Language Models (LLMs) significantly outperform models trained on data that is closely matched to the tasks in non-coding Chinese tasks. Moreover, in tasks high sensitivity to Chinese hallucinations, models exhibiting fewer linguistic features of the Chinese language achieve better performance. Our experimental results can be easily replicated in Chinese data processing tasks, such as preparing data for Retrieval-Augmented Generation (RAG), by simply replacing the base model with a code-based model. Additionally, our research offers a distinct perspective for discussion on the philosophical "Chinese Room" thought experiment.
Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys
Authors: Yong Cao, Min Chen, Daniel Hershcovich
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10352
Pdf link: https://arxiv.org/pdf/2401.10352
Abstract The cultural landscape of interactions with dialogue agents is a compelling yet relatively unexplored territory. It's clear that various sociocultural aspects -- from communication styles and beliefs to shared metaphors and knowledge -- profoundly impact these interactions. To delve deeper into this dynamic, we introduce cuDialog, a first-of-its-kind benchmark for dialogue generation with a cultural lens. We also develop baseline models capable of extracting cultural attributes from dialogue exchanges, with the goal of enhancing the predictive accuracy and quality of dialogue agents. To effectively co-learn cultural understanding and multi-turn dialogue predictions, we propose to incorporate cultural dimensions with dialogue encoding features. Our experimental findings highlight that incorporating cultural value surveys boosts alignment with references and cultural markers, demonstrating its considerable influence on personalization and dialogue quality. To facilitate further exploration in this exciting domain, we publish our benchmark publicly accessible at https://github.com/yongcaoplus/cuDialog.
Learning High-Quality and General-Purpose Phrase Representations
Authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10407
Pdf link: https://arxiv.org/pdf/2401.10407
Abstract Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification. The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences. Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible. We propose an improved framework to learn phrase representations in a context-free fashion. The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation. Furthermore, we design three granularities of data augmentation to increase the diversity of training samples. Our experiments across a wide range of tasks show that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size. The code is available at \faGithub~ \url{https://github.com/tigerchen52/PEARL} \end{abstract}
PHOENIX: Open-Source Language Adaption for Direct Preference Optimization
Authors: Matthias Uhlig, Sigurd Schacht, Sudarshan Kamath Barkur
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10580
Pdf link: https://arxiv.org/pdf/2401.10580
Abstract Large language models have gained immense importance in recent years and have demonstrated outstanding results in solving various tasks. However, despite these achievements, many questions remain unanswered in the context of large language models. Besides the optimal use of the models for inference and the alignment of the results to the desired specifications, the transfer of models to other languages is still an underdeveloped area of research. The recent publication of models such as Llama-2 and Zephyr has provided new insights into architectural improvements and the use of human feedback. However, insights into adapting these techniques to other languages remain scarce. In this paper, we build on latest improvements and apply the Direct Preference Optimization(DPO) approach to the German language. The model is available at https://huggingface.co/DRXD1000/Phoenix.
Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment
Authors: Fanqi Wan, Xinting Huang, Leyang Cui, Xiaojun Quan, Wei Bi, Shuming Shi
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10768
Pdf link: https://arxiv.org/pdf/2401.10768
Abstract While Large Language Models (LLMs) have proven to be exceptional on a variety of tasks after alignment, they may still produce responses that contradict the context or world knowledge confidently, a phenomenon known as ``hallucination''. In this paper, we demonstrate that reducing the inconsistency between the external knowledge encapsulated in the training data and the intrinsic knowledge inherited in the pretraining corpus could mitigate hallucination in alignment. Specifically, we introduce a novel knowledge consistent alignment (KCA) approach, which involves automatically formulating examinations based on external knowledge for accessing the comprehension of LLMs. For data encompassing knowledge inconsistency, KCA implements several simple yet efficient strategies for processing. We illustrate the superior performance of the proposed KCA approach in mitigating hallucinations across six benchmarks using LLMs of different backbones and scales. Furthermore, we confirm the correlation between knowledge inconsistency and hallucination, signifying the effectiveness of reducing knowledge inconsistency in alleviating hallucinations. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/KCA}.
Keyword: aligning

There is no result

Keyword: align

There is no result

Keyword: vision language

There is no result

Keyword: vision-language

There is no result

Keyword: language-vision

There is no result

Keyword: phrase-grounding

There is no result

Keyword: phrase grounding

There is no result

Keyword: reference expression comprehension

There is no result

Keyword: chest

There is no result

Keyword: x-ray

There is no result

Keyword: clinical

There is no result

Keyword: biomedical

Name Tagging Under Domain Shift via Metric Learning for Life Sciences
Authors: Hongyi Liu, Qingyun Wang, Payam Karisani, Heng Ji
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10472
Pdf link: https://arxiv.org/pdf/2401.10472
Abstract Name tagging is a key component of Information Extraction (IE), particularly in scientific domains such as biomedicine and chemistry, where large language models (LLMs), e.g., ChatGPT, fall short. We investigate the applicability of transfer learning for enhancing a name tagging model trained in the biomedical domain (the source domain) to be used in the chemical domain (the target domain). A common practice for training such a model in a few-shot learning setting is to pretrain the model on the labeled source data, and then, to finetune it on a hand-full of labeled target examples. In our experiments we observed that such a model is prone to mis-labeling the source entities, which can often appear in the text, as the target entities. To alleviate this problem, we propose a model to transfer the knowledge from the source domain to the target domain, however, at the same time, to project the source entities and target entities into separate regions of the feature space. This diminishes the risk of mis-labeling the source entities as the target entities. Our model consists of two stages: 1) entity grouping in the source domain, which incorporates knowledge from annotated events to establish relations between entities, and 2) entity discrimination in the target domain, which relies on pseudo labeling and contrastive learning to enhance discrimination between the entities in the two domains. We carry out our extensive experiments across three source and three target datasets, and demonstrate that our method outperforms the baselines, in some scenarios by 5\% absolute value.
Keyword: radiology

There is no result

Keyword: radiography

There is no result

Keyword: medical

Advancements in eHealth Data Analytics through Natural Language Processing and Deep Learning
Authors: Elena-Simona Apostol, Ciprian-Octavian Truică
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.10850
Pdf link: https://arxiv.org/pdf/2401.10850
Abstract The healthcare environment is commonly referred to as "information-rich" but also "knowledge poor". Healthcare systems collect huge amounts of data from various sources: lab reports, medical letters, logs of medical tools or programs, medical prescriptions, etc. These massive sets of data can provide great knowledge and information that can improve the medical services, and overall the healthcare domain, such as disease prediction by analyzing the patient's symptoms or disease prevention, by facilitating the discovery of behavioral factors for diseases. Unfortunately, only a relatively small volume of the textual eHealth data is processed and interpreted, an important factor being the difficulty in efficiently performing Big Data operations. In the medical field, detecting domain-specific multi-word terms is a crucial task as they can define an entire concept with a few words. A term can be defined as a linguistic structure or a concept, and it is composed of one or more words with a specific meaning to a domain. All the terms of a domain create its terminology. This chapter offers a critical study of the current, most performant solutions for analyzing unstructured (image and textual) eHealth data. This study also provides a comparison of the current Natural Language Processing and Deep Learning techniques in the eHealth context. Finally, we examine and discuss some of the current issues, and we define a set of research directions in this area.
Keyword: active-learning

There is no result

Keyword: active learning

There is no result

Keyword: chexpert

There is no result

Keyword: vision

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition
Authors: Yong Wang, Cheng Lu, Hailun Lian, Yan Zhao, Björn Schuller, Yuan Zong, Wenming Zheng
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10536
Pdf link: https://arxiv.org/pdf/2401.10536
Abstract Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.
LangBridge: Multilingual Reasoning Without Multilingual Supervision
Authors: Dongkeun Yoon, Joel Jang, Sungdong Kim, Seungone Kim, Sheikh Shafayat, Minjoon Seo
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10695
Pdf link: https://arxiv.org/pdf/2401.10695
Abstract We introduce LangBridge, a zero-shot approach to adapt language models for multilingual reasoning tasks without multilingual supervision. LangBridge operates by bridging two models, each specialized in different aspects: (1) one specialized in understanding multiple languages (e.g., mT5 encoder) and (2) one specialized in reasoning (e.g., Orca 2). LangBridge connects the two models by introducing minimal trainable parameters between them. Despite utilizing only English data for training, LangBridge considerably enhances the performance of language models on low-resource languages across mathematical reasoning, coding, and logical reasoning. Our analysis suggests that the efficacy of LangBridge stems from the language-agnostic characteristics of multilingual representations. We publicly release our code and models.
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
Authors: Haibo Wang, Chenghang Lai, Yixuan Sun, Weifeng Ge
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10711
Pdf link: https://arxiv.org/pdf/2401.10711
Abstract Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments, which will be pseudo-labels. With these pseudo-labels as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several VideoQA benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.
Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach
Authors: Weide Liu, Huijing Zhan, Hao Chen, Fengmao Lv
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.10747
Pdf link: https://arxiv.org/pdf/2401.10747
Abstract Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most of the existing research efforts assume that all modalities are available during both training and testing, making their algorithms susceptible to the missing modality scenario. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio modalities. Moreover, we develop a cross-modality attention mechanism to retain the maximal information of the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baselines and achieve comparable results to the previous methods with complete multi-modality supervision.
Keyword: visual

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Authors: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.10529
Pdf link: https://arxiv.org/pdf/2401.10529
Abstract Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
Authors: Haibi Wang, Weifeng Ge
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10712
Pdf link: https://arxiv.org/pdf/2401.10712
Abstract With the breakthrough of multi-modal large language models, answering complex visual questions that demand advanced reasoning abilities and world knowledge has become a much more important testbed for developing AI models than ever. However, equipping AI models with robust cross-modality reasoning ability remains challenging since the cognition scheme of humans has not been understood systematically. In this paper, we believe that if we can collect visual clues in the given image as much as possible, we will recognize the image more accurately, understand the question better, recall relevant knowledge more easily, and finally reason out the answer. We discover these rich visual clues by mining question-answer pairs in images and sending them into multi-modal large language models as prompts. We call the proposed method Q&A Prompts. Specifically, we first use the image-answer pairs and the corresponding questions in the training set as inputs and outputs to train a visual question generation model. Then, we use an image tagging model to identify various instances and send packaged image-tag pairs into the visual question generation model to generate relevant questions with the extracted image tags as answers. Finally, we encode these generated question-answer pairs as prompts with a visual-aware prompting module and send them into pre-trained multi-modal large language models to reason out the final answers. Experimental results show that, compared with state-of-the-art methods, our Q&A Prompts achieves substantial improvements on the challenging visual question answering datasets requiring reasoning over diverse world knowledge, such as OK-VQA and A-OKVQA.
Keyword: visio-linguistic

There is no result

Keyword: cross-modal

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Chao Zhang, Pin-Yu Chen, EnSiong Chng
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.10446
Pdf link: https://arxiv.org/pdf/2401.10446
Abstract Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising.
Keyword: modality

A systematic review of geospatial location embedding approaches in large language models: A path to spatial AI systems
Authors: Sean Tucker
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.10279
Pdf link: https://arxiv.org/pdf/2401.10279
Abstract Geospatial Location Embedding (GLE) helps a Large Language Model (LLM) assimilate and analyze spatial data. GLE emergence in Geospatial Artificial Intelligence (GeoAI) is precipitated by the need for deeper geospatial awareness in our complex contemporary spaces and the success of LLMs in extracting deep meaning in Generative AI. We searched Google Scholar, Science Direct, and arXiv for papers on geospatial location embedding and LLM and reviewed articles focused on gaining deeper spatial "knowing" through LLMs. We screened 304 titles, 30 abstracts, and 18 full-text papers that reveal four GLE themes - Entity Location Embedding (ELE), Document Location Embedding (DLE), Sequence Location Embedding (SLE), and Token Location Embedding (TLE). Synthesis is tabular and narrative, including a dialogic conversation between "Space" and "LLM." Though GLEs aid spatial understanding by superimposing spatial data, they emphasize the need to advance in the intricacies of spatial modalities and generalized reasoning. GLEs signal the need for a Spatial Foundation/Language Model (SLM) that embeds spatial knowing within the model architecture. The SLM framework advances Spatial Artificial Intelligence Systems (SPAIS), establishing a Spatial Vector Space (SVS) that maps to physical space. The resulting spatially imbued Language Model is unique. It simultaneously represents actual space and an AI-capable space, paving the way for AI native geo storage, analysis, and multi-modality as the basis for Spatial Artificial Intelligence Systems (SPAIS).
Keyword: modalities

There is no result

Keyword: multi-modal

There is no result

Keyword: multimodal

There is no result

PanagiotisFytas / get-daily-arxiv-noti

15 New submissions for Mon, 22 Jan 24 #443

Keyword: alignment

Top in Chinese Data Processing: English Code Models

Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys

Learning High-Quality and General-Purpose Phrase Representations

PHOENIX: Open-Source Language Adaption for Direct Preference Optimization

Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment

Keyword: aligning

Keyword: align

Keyword: vision language

Keyword: vision-language

Keyword: language-vision

Keyword: phrase-grounding

Keyword: phrase grounding

Keyword: reference expression comprehension

Keyword: chest

Keyword: x-ray

Keyword: clinical

Keyword: biomedical

Name Tagging Under Domain Shift via Metric Learning for Life Sciences

Keyword: radiology

Keyword: radiography

Keyword: medical

Advancements in eHealth Data Analytics through Natural Language Processing and Deep Learning

Keyword: active-learning

Keyword: active learning

Keyword: chexpert

Keyword: vision

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

LangBridge: Multilingual Reasoning Without Multilingual Supervision

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

Keyword: visual

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Keyword: visio-linguistic

Keyword: cross-modal

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Keyword: modality

A systematic review of geospatial location embedding approaches in large language models: A path to spatial AI systems

Keyword: modalities

Keyword: multi-modal

Keyword: multimodal