7 New submissions for Mon, 15 Apr 24

Keyword: alignment

There is no result

Keyword: aligning

There is no result

Keyword: align

Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian

Authors: Aleksa Cvetanović, Predrag Tadić
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.08617
Pdf link: https://arxiv.org/pdf/2404.08617
Abstract In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTi\'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
Keyword: vision language

There is no result

Keyword: vision-language

There is no result

Keyword: language-vision

There is no result

Keyword: phrase-grounding

There is no result

Keyword: phrase grounding

There is no result

Keyword: reference expression comprehension

There is no result

Keyword: chest

A Multi-Level Framework for Accelerating Training Transformer Models
Authors: Longwei Zou, Han Zhang, Yangdong Deng
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.07999
Pdf link: https://arxiv.org/pdf/2404.07999
Abstract The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.
Keyword: x-ray

There is no result

Keyword: clinical

There is no result

Keyword: biomedical

There is no result

Keyword: radiology

There is no result

Keyword: radiography

There is no result

Keyword: medical

Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval
Authors: Juraj Vladika, Florian Matthes
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2404.08359
Pdf link: https://arxiv.org/pdf/2404.08359
Abstract In today's digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline's performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.
Keyword: active-learning

There is no result

Keyword: active learning

SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions
Authors: Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08078
Pdf link: https://arxiv.org/pdf/2404.08078
Abstract Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.
Keyword: chexpert

There is no result

Keyword: vision

S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing
Authors: Guangzhi Wang, Tianyi Chen, Kamran Ghasedi, HsiangTao Wu, Tianyu Ding, Chris Nuesmeyer, Ilya Zharkov, Mohan Kankanhalli, Luming Liang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.08111
Pdf link: https://arxiv.org/pdf/2404.08111
Abstract Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.
Keyword: visual

There is no result

Keyword: visio-linguistic

There is no result

Keyword: cross-modal

There is no result

Keyword: modality

There is no result

Keyword: modalities

There is no result

Keyword: multi-modal

There is no result

Keyword: multimodal

Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models
Authors: Md Messal Monem Miah, Ulie Schnaithmann, Arushi Raghuvanshi, Youngseo Son
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.08156
Pdf link: https://arxiv.org/pdf/2404.08156
Abstract Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
Authors: Ruqi Liao, Chuqing Zhao, Jin Li, Weiqi Feng
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08567
Pdf link: https://arxiv.org/pdf/2404.08567
Abstract In response to the rising interest in large multimodal models, we introduce Cross-Attention Token Pruning (CATP), a precision-focused token pruning method. Our approach leverages cross-attention layers in multimodal models, exemplified by BLIP-2, to extract valuable information for token importance determination. CATP employs a refined voting strategy across model heads and layers. In evaluations, CATP achieves up to 12.1X higher accuracy compared to existing token pruning methods, addressing the trade-off between computational efficiency and model precision.

PanagiotisFytas / get-daily-arxiv-noti

7 New submissions for Mon, 15 Apr 24 #559

Keyword: alignment

Keyword: aligning

Keyword: align

Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian

Keyword: vision language

Keyword: vision-language

Keyword: language-vision

Keyword: phrase-grounding

Keyword: phrase grounding

Keyword: reference expression comprehension

Keyword: chest

A Multi-Level Framework for Accelerating Training Transformer Models

Keyword: x-ray

Keyword: clinical

Keyword: biomedical

Keyword: radiology

Keyword: radiography

Keyword: medical

Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval

Keyword: active-learning

Keyword: active learning

SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Keyword: chexpert

Keyword: vision

S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing

Keyword: visual

Keyword: visio-linguistic

Keyword: cross-modal

Keyword: modality

Keyword: modalities

Keyword: multi-modal

Keyword: multimodal

Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models

CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference