Abstract
In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTi\'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
Keyword: vision language
There is no result
Keyword: vision-language
There is no result
Keyword: language-vision
There is no result
Keyword: phrase-grounding
There is no result
Keyword: phrase grounding
There is no result
Keyword: reference expression comprehension
There is no result
Keyword: chest
A Multi-Level Framework for Accelerating Training Transformer Models
Authors: Longwei Zou, Han Zhang, Yangdong Deng
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Abstract
The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing power, which incurs exponentially increasing energy cost and carbon dioxide emissions. It is thus critical to develop efficient training solutions to reduce the training costs. Motivated by a set of key observations of inter- and intra-layer similarities among feature maps and attentions that can be identified from typical training processes, we propose a multi-level framework for training acceleration. Specifically, the framework is based on three basic operators, Coalescing, De-coalescing and Interpolation, which can be orchestrated to build a multi-level training framework. The framework consists of a V-cycle training process, which progressively down- and up-scales the model size and projects the parameters between adjacent levels of models via coalescing and de-coalescing. The key idea is that a smaller model that can be trained for fast convergence and the trained parameters provides high-qualities intermediate solutions for the next level larger network. The interpolation operator is designed to break the symmetry of neurons incurred by de-coalescing for better convergence performance. Our experiments on transformer-based language models (e.g. Bert, GPT) as well as a vision model (e.g. DeiT) prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model while preserving the performance.
Keyword: x-ray
There is no result
Keyword: clinical
There is no result
Keyword: biomedical
There is no result
Keyword: radiology
There is no result
Keyword: radiography
There is no result
Keyword: medical
Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval
Authors: Juraj Vladika, Florian Matthes
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Abstract
In today's digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline's performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.
Keyword: active-learning
There is no result
Keyword: active learning
SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions
Authors: Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.
Keyword: chexpert
There is no result
Keyword: vision
S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing
Abstract
Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.
Keyword: visual
There is no result
Keyword: visio-linguistic
There is no result
Keyword: cross-modal
There is no result
Keyword: modality
There is no result
Keyword: modalities
There is no result
Keyword: multi-modal
There is no result
Keyword: multimodal
Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models
Abstract
Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference
Authors: Ruqi Liao, Chuqing Zhao, Jin Li, Weiqi Feng
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
In response to the rising interest in large multimodal models, we introduce Cross-Attention Token Pruning (CATP), a precision-focused token pruning method. Our approach leverages cross-attention layers in multimodal models, exemplified by BLIP-2, to extract valuable information for token importance determination. CATP employs a refined voting strategy across model heads and layers. In evaluations, CATP achieves up to 12.1X higher accuracy compared to existing token pruning methods, addressing the trade-off between computational efficiency and model precision.
Keyword: alignment
There is no result
Keyword: aligning
There is no result
Keyword: align
Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian
Keyword: vision language
There is no result
Keyword: vision-language
There is no result
Keyword: language-vision
There is no result
Keyword: phrase-grounding
There is no result
Keyword: phrase grounding
There is no result
Keyword: reference expression comprehension
There is no result
Keyword: chest
A Multi-Level Framework for Accelerating Training Transformer Models
Keyword: x-ray
There is no result
Keyword: clinical
There is no result
Keyword: biomedical
There is no result
Keyword: radiology
There is no result
Keyword: radiography
There is no result
Keyword: medical
Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval
Keyword: active-learning
There is no result
Keyword: active learning
SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions
Keyword: chexpert
There is no result
Keyword: vision
S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing
Keyword: visual
There is no result
Keyword: visio-linguistic
There is no result
Keyword: cross-modal
There is no result
Keyword: modality
There is no result
Keyword: modalities
There is no result
Keyword: multi-modal
There is no result
Keyword: multimodal
Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference