Abstract
Anomaly detection is critical for the secure and reliable operation of industrial control systems. As our reliance on such complex cyber-physical systems grows, it becomes paramount to have automated methods for detecting anomalies, preventing attacks, and responding intelligently. {This paper presents a novel deep generative model to meet this need. The proposed model follows a variational autoencoder architecture with a convolutional encoder and decoder to extract features from both spatial and temporal dimensions. Additionally, we incorporate an attention mechanism that directs focus towards specific regions, enhancing the representation of relevant features and improving anomaly detection accuracy. We also employ a dynamic threshold approach leveraging the reconstruction probability and make our source code publicly available to promote reproducibility and facilitate further research. Comprehensive experimental analysis is conducted on data from all six stages of the Secure Water Treatment (SWaT) testbed, and the experimental results demonstrate the superior performance of our approach compared to several state-of-the-art baseline techniques.
Title:
Challenges for Responsible AI Design and Workflow Integration in Healthcare: A Case Study of Automatic Feeding Tube Qualification in Radiology
Authors: Anja Thieme, Abhijith Rajamohan, Benjamin Cooper, Heather Groombridge, Robert Simister, Barney Wong, Nicholas Woznitza, Mark Ames Pinnock, Maria Teodora Wetscherek, Cecily Morrison, Hannah Richardson, Fernando Pérez-García, Stephanie L. Hyland, Shruthi Bannur, Daniel C. Castro, Kenza Bouzid, Anton Schwaighofer, Mercy Ranjit, Harshita Sharma, Matthew P. Lungren, Ozan Oktay, Javier Alvarez-Valle, Aditya Nori, Stephen Harris, Joseph Jacob
Abstract
Nasogastric tubes (NGTs) are feeding tubes that are inserted through the nose into the stomach to deliver nutrition or medication. If not placed correctly, they can cause serious harm, even death to patients. Recent AI developments demonstrate the feasibility of robustly detecting NGT placement from Chest X-ray images to reduce risks of sub-optimally or critically placed NGTs being missed or delayed in their detection, but gaps remain in clinical practice integration. In this study, we present a human-centered approach to the problem and describe insights derived following contextual inquiry and in-depth interviews with 15 clinical stakeholders. The interviews helped understand challenges in existing workflows, and how best to align technical capabilities with user needs and expectations. We discovered the trade-offs and complexities that need consideration when choosing suitable workflow stages, target users, and design configurations for different AI proposals. We explored how to balance AI benefits and risks for healthcare staff and patients within broader organizational and medical-legal constraints. We also identified data issues related to edge cases and data biases that affect model training and evaluation; how data documentation practices influence data preparation and labelling; and how to measure relevant AI outcomes reliably in future evaluations. We discuss how our work informs design and development of AI applications that are clinically useful, ethical, and acceptable in real-world healthcare services.
Title:
Mutual information and the encoding of contingency tables
Authors: Maximilian Jerdee, Alec Kirkley, M. E. J. Newman
Subjects: Subjects:
Social and Information Networks (cs.SI)
Abstract
Mutual information is commonly used as a measure of similarity between competing labelings of a given set of objects, for example to quantify performance in classification and community detection tasks. As argued recently, however, the mutual information as conventionally defined can return biased results because it neglects the information cost of the so-called contingency table, a crucial component of the similarity calculation. In principle the bias can be rectified by subtracting the appropriate information cost, leading to the modified measure known as the reduced mutual information, but in practice one can only ever compute an upper bound on this information cost, and the value of the reduced mutual information depends crucially on how good a bound is established. In this paper we describe an improved method for encoding contingency tables that gives a substantially better bound in typical use cases, and approaches the ideal value in the common case where the labelings are closely similar, as we demonstrate with extensive numerical results.
Title:
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Authors: Joshua Clymer, Caden Juang, Severin Field
Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consistently benign (aligned). The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking). The task is to identify the alignment faking model using only inputs where the two models behave identically. We test five detection strategies, one of which identifies 98% of alignment-fakers.
Title:
PLLM-CS: Pre-trained Large Language Model (LLM) for Cyber Threat Detection in Satellite Networks
Authors: Mohammed Hassanin, Marwa Keshk, Sara Salim, Majid Alsubaie, Dharmendra Sharma
Subjects: Subjects:
Cryptography and Security (cs.CR)
Abstract
Satellite networks are vital in facilitating communication services for various critical infrastructures. These networks can seamlessly integrate with a diverse array of systems. However, some of these systems are vulnerable due to the absence of effective intrusion detection systems, which can be attributed to limited research and the high costs associated with deploying, fine-tuning, monitoring, and responding to security breaches. To address these challenges, we propose a pretrained Large Language Model for Cyber Security , for short PLLM-CS, which is a variant of pre-trained Transformers [1], which includes a specialized module for transforming network data into contextually suitable inputs. This transformation enables the proposed LLM to encode contextual information within the cyber data. To validate the efficacy of the proposed method, we conducted empirical experiments using two publicly available network datasets, UNSW_NB 15 and TON_IoT, both providing Internet of Things (IoT)-based traffic data. Our experiments demonstrate that proposed LLM method outperforms state-of-the-art techniques such as BiLSTM, GRU, and CNN. Notably, the PLLM-CS method achieves an outstanding accuracy level of 100% on the UNSW_NB 15 dataset, setting a new standard for benchmark performance in this domain.
Title:
HPPS: A Hierarchical Progressive Perception System for Luggage Trolley Detection and Localization at Airports
Authors: Zhirui Sun, Zhe Zhang, Jieting Zhao, Hanjing Ye, Jiankun Wang
Abstract
The robotic autonomous luggage trolley collection system employs robots to gather and transport scattered luggage trolleys at airports. However, existing methods for detecting and locating these luggage trolleys often fail when they are not fully visible. To address this, we introduce the Hierarchical Progressive Perception System (HPPS), which enhances the detection and localization of luggage trolleys under partial occlusion. The HPPS processes the luggage trolley's position and orientation separately, which requires only RGB images for labeling and training, eliminating the need for 3D coordinates and alignment. The HPPS can accurately determine the position of the luggage trolley with just one well-detected keypoint and estimate the luggage trolley's orientation when it is partially occluded. Once the luggage trolley's initial pose is detected, HPPS updates this information continuously to refine its accuracy until the robot begins grasping. The experiments on detection and localization demonstrate that HPPS is more reliable under partial occlusion compared to existing methods. Its effectiveness and robustness have also been confirmed through practical tests in actual luggage trolley collection tasks. A website about this work is available at HPPS.
Title:
The object detection model uses combined extraction with KNN and RF classification
Authors: Florentina Tatrin Kurniati, Daniel HF Manongga, Irwan Sembiring, Sutarto Wijono, Roy Rudolf Huizen
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Object detection plays an important role in various fields. Developing detection models for 2D objects that experience rotation and texture variations is a challenge. In this research, the initial stage of the proposed model integrates the gray-level co-occurrence matrix (GLCM) and local binary patterns (LBP) texture feature extraction to obtain feature vectors. The next stage is classifying features using k-nearest neighbors (KNN) and random forest (RF), as well as voting ensemble (VE). System testing used a dataset of 4,437 2D images, the results for KNN accuracy were 92.7% and F1-score 92.5%, while RF performance was lower. Although GLCM features improve performance on both algorithms, KNN is more consistent. The VE approach provides the best performance with an accuracy of 93.9% and an F1 score of 93.8%, this shows the effectiveness of the ensemble technique in increasing object detection accuracy. This study contributes to the field of object detection with a new approach combining GLCM and LBP as feature vectors as well as VE for classification
Title:
Towards Robust Physical-world Backdoor Attacks on Lane Detection
Abstract
Deep learning-based lane detection (LD) plays a critical role in autonomous driving systems, such as adaptive cruise control. However, it is vulnerable to backdoor attacks. Existing backdoor attack methods on LD exhibit limited effectiveness in dynamic real-world scenarios, primarily because they fail to consider dynamic scene factors, including changes in driving perspectives (e.g., viewpoint transformations) and environmental conditions (e.g., weather or lighting changes). To tackle this issue, this paper introduces BadLANE, a dynamic scene adaptation backdoor attack for LD designed to withstand changes in real-world dynamic scene factors. To address the challenges posed by changing driving perspectives, we propose an amorphous trigger pattern composed of shapeless pixels. This trigger design allows the backdoor to be activated by various forms or shapes of mud spots or pollution on the road or lens, enabling adaptation to changes in vehicle observation viewpoints during driving. To mitigate the effects of environmental changes, we design a meta-learning framework to train meta-generators tailored to different environmental conditions. These generators produce meta-triggers that incorporate diverse environmental information, such as weather or lighting conditions, as the initialization of the trigger patterns for backdoor implantation, thus enabling adaptation to dynamic environments. Extensive experiments on various commonly used LD models in both digital and physical domains validate the effectiveness of our attacks, outperforming other baselines significantly (+25.15\% on average in Attack Success Rate). Our codes will be available upon paper publication.
Title:
Privacy-Preserving Edge Federated Learning for Intelligent Mobile-Health Systems
Authors: Amin Aminifar, Matin Shokri, Amir Aminifar
Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Abstract
Machine Learning (ML) algorithms are generally designed for scenarios in which all data is stored in one data center, where the training is performed. However, in many applications, e.g., in the healthcare domain, the training data is distributed among several entities, e.g., different hospitals or patients' mobile devices/sensors. At the same time, transferring the data to a central location for learning is certainly not an option, due to privacy concerns and legal issues, and in certain cases, because of the communication and computation overheads. Federated Learning (FL) is the state-of-the-art collaborative ML approach for training an ML model across multiple parties holding local data samples, without sharing them. However, enabling learning from distributed data over such edge Internet of Things (IoT) systems (e.g., mobile-health and wearable technologies, involving sensitive personal/medical data) in a privacy-preserving fashion presents a major challenge mainly due to their stringent resource constraints, i.e., limited computing capacity, communication bandwidth, memory storage, and battery lifetime. In this paper, we propose a privacy-preserving edge FL framework for resource-constrained mobile-health and wearable technologies over the IoT infrastructure. We evaluate our proposed framework extensively and provide the implementation of our technique on Amazon's AWS cloud platform based on the seizure detection application in epilepsy monitoring using wearable technologies.
Title:
Depth Awakens: A Depth-perceptual Attention Fusion Network for RGB-D Camouflaged Object Detection
Authors: Xinran Liua, Lin Qia, Yuxuan Songa, Qi Wen
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
Abstract
Camouflaged object detection (COD) presents a persistent challenge in accurately identifying objects that seamlessly blend into their surroundings. However, most existing COD models overlook the fact that visual systems operate within a genuine 3D environment. The scene depth inherent in a single 2D image provides rich spatial clues that can assist in the detection of camouflaged objects. Therefore, we propose a novel depth-perception attention fusion network that leverages the depth map as an auxiliary input to enhance the network's ability to perceive 3D information, which is typically challenging for the human eye to discern from 2D images. The network uses a trident-branch encoder to extract chromatic and depth information and their communications. Recognizing that certain regions of a depth map may not effectively highlight the camouflaged object, we introduce a depth-weighted cross-attention fusion module to dynamically adjust the fusion weights on depth and RGB feature maps. To keep the model simple without compromising effectiveness, we design a straightforward feature aggregation decoder that adaptively fuses the enhanced aggregated features. Experiments demonstrate the significant superiority of our proposed method over other states of the arts, which further validates the contribution of depth information in camouflaged object detection. The code will be available at this https URL.
Title:
An Automatic Prompt Generation System for Tabular Data Tasks
Abstract
Efficient processing of tabular data is important in various industries, especially when working with datasets containing a large number of columns. Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. However, creating effective prompts for tabular datasets is challenging due to the structured nature of the data and the need to manage numerous columns. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training. It proposes two novel methods; 1) A Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns 2) Cell-level similarity-based approach for enhancing few-shot example selection. Our approach has been extensively tested across 66 datasets, demonstrating improved performance in three downstream tasks: data imputation, error detection, and entity matching using two distinct LLMs; Google flan-t5-xxl and Mixtral 8x7B.
Title:
ASGrasp: Generalizable Transparent Object Reconstruction and Grasping from RGB-D Active Stereo Camera
Authors: Jun Shi, Yong A, Yixiang Jin, Dingzhe Li, Haoyu Niu, Zhezhu Jin, He Wang
Subjects: Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
In this paper, we tackle the problem of grasping transparent and specular objects. This issue holds importance, yet it remains unsolved within the field of robotics due to failure of recover their accurate geometry by depth cameras. For the first time, we propose ASGrasp, a 6-DoF grasp detection network that uses an RGB-D active stereo camera. ASGrasp utilizes a two-layer learning-based stereo network for the purpose of transparent object reconstruction, enabling material-agnostic object grasping in cluttered environments. In contrast to existing RGB-D based grasp detection methods, which heavily depend on depth restoration networks and the quality of depth maps generated by depth cameras, our system distinguishes itself by its ability to directly utilize raw IR and RGB images for transparent object geometry reconstruction. We create an extensive synthetic dataset through domain randomization, which is based on GraspNet-1Billion. Our experiments demonstrate that ASGrasp can achieve over 90% success rate for generalizable transparent object grasping in both simulation and the real via seamless sim-to-real transfer. Our method significantly outperforms SOTA networks and even surpasses the performance upper bound set by perfect visible point cloud inputs.Project page: this https URL
Title:
Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution
Authors: Sandrine Chausson, Björn Ross
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
Many tasks related to Computational Social Science and Web Content Analysis involve classifying pieces of text based on the claims they contain. State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. In light of this, we propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. This methodology involves defining the classes as arbitrarily sophisticated taxonomies of claims, and using Natural Language Inference models to obtain the textual entailment between these and a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the well-established statistical heuristic of Probabilistic Bisection. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection. This approach rivals traditional pre-train/fine-tune approaches while drastically reducing the need for data annotation.
Title:
Private Online Community Detection for Censored Block Models
Authors: Mohamed Seif, Liyan Xie, Andrea J. Goldsmith, H. Vincent Poor
Subjects: Subjects:
Social and Information Networks (cs.SI); Cryptography and Security (cs.CR); Information Theory (cs.IT)
Abstract
We study the private online change detection problem for dynamic communities, using a censored block model (CBM). Focusing on the notion of edge differential privacy (DP), we seek to understand the fundamental tradeoffs between the privacy budget, detection delay, and exact community recovery of community labels. We establish the theoretical lower bound on the delay in detecting changes privately and propose an algorithm capable of identifying changes in the community structure, while maintaining user privacy. Further, we provide theoretical guarantees for the effectiveness of our proposed method by showing necessary and sufficient conditions on change detection and exact recovery under edge DP. Simulation and real data examples are provided to validate the proposed method.
Abstract
In recent years, convolutional neural networks (CNNs) with channel-wise feature refining mechanisms have brought noticeable benefits to modelling channel dependencies. However, current attention paradigms fail to infer an optimal channel descriptor capable of simultaneously exploiting statistical and spatial relationships among feature maps. In this paper, to overcome this shortcoming, we present a novel channel-wise spatially autocorrelated (CSA) attention mechanism. Inspired by geographical analysis, the proposed CSA exploits the spatial relationships between channels of feature maps to produce an effective channel descriptor. To the best of our knowledge, this is the f irst time that the concept of geographical spatial analysis is utilized in deep CNNs. The proposed CSA imposes negligible learning parameters and light computational overhead to the deep model, making it a powerful yet efficient attention module of choice. We validate the effectiveness of the proposed CSA networks (CSA-Nets) through extensive experiments and analysis on ImageNet, and MS COCO benchmark datasets for image classification, object detection, and instance segmentation. The experimental results demonstrate that CSA-Nets are able to consistently achieve competitive performance and superior generalization than several state-of-the-art attention-based CNNs over different benchmark tasks and datasets.
Title:
To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems
Authors: Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho
Abstract
The increasing reliance on Deep Learning models, combined with their inherent lack of transparency, has spurred the development of a novel field of study known as eXplainable AI (XAI) methods. These methods seek to enhance the trust of end-users in automated systems by providing insights into the rationale behind their decisions. This paper presents a novel approach for measuring user trust in XAI systems, allowing their refinement. Our proposed metric combines both performance metrics and trust indicators from an objective perspective. To validate this novel methodology, we conducted a case study in a realistic medical scenario: the usage of XAI system for the detection of pneumonia from x-ray images.
Title:
Exploiting Autoencoder's Weakness to Generate Pseudo Anomalies
Authors: Marcella Astrid, Muhammad Zaigham Zaheer, Djamila Aouada, Seung-Ik Lee
Abstract
Due to the rare occurrence of anomalous events, a typical approach to anomaly detection is to train an autoencoder (AE) with normal data only so that it learns the patterns or representations of the normal training data. At test time, the trained AE is expected to well reconstruct normal but to poorly reconstruct anomalous data. However, contrary to the expectation, anomalous data is often well reconstructed as well. In order to further separate the reconstruction quality between normal and anomalous data, we propose creating pseudo anomalies from learned adaptive noise by exploiting the aforementioned weakness of AE, i.e., reconstructing anomalies too well. The generated noise is added to the normal data to create pseudo anomalies. Extensive experiments on Ped2, Avenue, ShanghaiTech, CIFAR-10, and KDDCUP datasets demonstrate the effectiveness and generic applicability of our approach in improving the discriminative capability of AEs for anomaly detection.
Title:
Deep Multi-Task Learning for Malware Image Classification
Authors: Ahmed Bensaoud, Jugal Kalita
Subjects: Subjects:
Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Malicious software is a pernicious global problem. A novel multi-task learning framework is proposed in this paper for malware image classification for accurate and fast malware detection. We generate bitmap (BMP) and (PNG) images from malware features, which we feed to a deep learning classifier. Our state-of-the-art multi-task learning approach has been tested on a new dataset, for which we have collected approximately 100,000 benign and malicious PE, APK, Mach-o, and ELF examples. Experiments with seven tasks tested with 4 activation functions, ReLU, LeakyReLU, PReLU, and ELU separately demonstrate that PReLU gives the highest accuracy of more than 99.87% on all tasks. Our model can effectively detect a variety of obfuscation methods like packing, encryption, and instruction overlapping, strengthing the beneficial claims of our model, in addition to achieving the state-of-art methods in terms of accuracy.
Title:
Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask
Authors: Zineb Senane, Lele Cao, Valentin Leonhard Buchner, Yusuke Tashiro, Lei You, Pawel Herman, Mats Nordahl, Ruibo Tu, Vilhelm von Ehrenheim
Abstract
Time Series Representation Learning (TSRL) focuses on generating informative representations for various Time Series (TS) modeling tasks. Traditional Self-Supervised Learning (SSL) methods in TSRL fall into four main categories: reconstructive, adversarial, contrastive, and predictive, each with a common challenge of sensitivity to noise and intricate data nuances. Recently, diffusion-based methods have shown advanced generative capabilities. However, they primarily target specific application scenarios like imputation and forecasting, leaving a gap in leveraging diffusion models for generic TSRL. Our work, Time Series Diffusion Embedding (TSDE), bridges this gap as the first diffusion-based SSL TSRL approach. TSDE segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask. It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism, to the observed part. We train a reverse diffusion process conditioned on the embeddings, designed to predict noise added to the masked part. Extensive experiments demonstrate TSDE's superiority in imputation, interpolation, forecasting, anomaly detection, classification, and clustering. We also conduct an ablation study, present embedding visualizations, and compare inference speed, further substantiating TSDE's efficiency and validity in learning representations of TS data.
Keyword: face recognition
Title:
A Comprehensive Survey of Masked Faces: Recognition, Detection, and Unmasking
Authors: Mohamed Mahmoud, Mahmoud SalahEldin Kasem, Hyun-Soo Kang
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Masked face recognition (MFR) has emerged as a critical domain in biometric identification, especially by the global COVID-19 pandemic, which introduced widespread face masks. This survey paper presents a comprehensive analysis of the challenges and advancements in recognising and detecting individuals with masked faces, which has seen innovative shifts due to the necessity of adapting to new societal norms. Advanced through deep learning techniques, MFR, along with Face Mask Recognition (FMR) and Face Unmasking (FU), represent significant areas of focus. These methods address unique challenges posed by obscured facial features, from fully to partially covered faces. Our comprehensive review delves into the various deep learning-based methodologies developed for MFR, FMR, and FU, highlighting their distinctive challenges and the solutions proposed to overcome them. Additionally, we explore benchmark datasets and evaluation metrics specifically tailored for assessing performance in MFR research. The survey also discusses the substantial obstacles still facing researchers in this field and proposes future directions for the ongoing development of more robust and effective masked face recognition systems. This paper serves as an invaluable resource for researchers and practitioners, offering insights into the evolving landscape of face recognition technologies in the face of global health crises and beyond.
Keyword: augmentation
Title:
A Study on Cognitive Effects of Canvas Size for Augmenting Drawing Skill
Abstract
In recent years, the field of generative artificial intelligence, particularly in the domain of image generation, has exerted a profound influence on society. Despite the capability of AI to produce images of high quality, the augmentation of users' drawing abilities through the provision of drawing support systems emerges as a challenging issue. In this study, we propose that a cognitive factor, specifically, the size of the canvas, may exert a considerable influence on the outcomes of imitative drawing sketches when utilizing reference images. To investigate this hypothesis, a web based drawing interface was utilized, designed specifically to evaluate the effect of the canvas size's proportionality to the reference image on the fidelity of the drawings produced. The findings from our research lend credence to the hypothesis that a drawing interface, featuring a canvas whose dimensions closely match those of the reference image, markedly improves the precision of user-generated sketches.
Title:
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
Authors: Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
Title:
AFEN: Respiratory Disease Classification using Ensemble Learning
Abstract
We present AFEN (Audio Feature Ensemble Learning), a model that leverages Convolutional Neural Networks (CNN) and XGBoost in an ensemble learning fashion to perform state-of-the-art audio classification for a range of respiratory diseases. We use a meticulously selected mix of audio features which provide the salient attributes of the data and allow for accurate classification. The extracted features are then used as an input to two separate model classifiers 1) a multi-feature CNN classifier and 2) an XGBoost Classifier. The outputs of the two models are then fused with the use of soft voting. Thus, by exploiting ensemble learning, we achieve increased robustness and accuracy. We evaluate the performance of the model on a database of 920 respiratory sounds, which undergoes data augmentation techniques to increase the diversity of the data and generalizability of the model. We empirically verify that AFEN sets a new state-of-the-art using Precision and Recall as metrics, while decreasing training time by 60%.
Title:
Redefining Information Retrieval of Structured Database via Large Language Models
Abstract
Retrieval augmentation is critical when Language Models (LMs) exploit non-parametric knowledge related to the query through external knowledge bases before reasoning. The retrieved information is incorporated into LMs as context alongside the query, enhancing the reliability of responses towards factual questions. Prior researches in retrieval augmentation typically follow a retriever-generator paradigm. In this context, traditional retrievers encounter challenges in precisely and seamlessly extracting query-relevant information from knowledge bases. To address this issue, this paper introduces a novel retrieval augmentation framework called ChatLR that primarily employs the powerful semantic understanding ability of Large Language Models (LLMs) as retrievers to achieve precise and concise information retrieval. Additionally, we construct an LLM-based search and question answering system tailored for the financial domain by fine-tuning LLM on two tasks including Text2API and API-ID recognition. Experimental results demonstrate the effectiveness of ChatLR in addressing user queries, achieving an overall information retrieval accuracy exceeding 98.8\%.
Title:
Universal Adversarial Perturbations for Vision-Language Pre-trained Models
Authors: Peng-Fei Zhang, Zi Huang, Guangdong Bai
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Abstract
Vision-language pre-trained (VLP) models have been the foundation of numerous vision-language tasks. Given their prevalence, it becomes imperative to assess their adversarial robustness, especially when deploying them in security-crucial real-world applications. Traditionally, adversarial perturbations generated for this assessment target specific VLP models, datasets, and/or downstream tasks. This practice suffers from low transferability and additional computation costs when transitioning to new scenarios. In this work, we thoroughly investigate whether VLP models are commonly sensitive to imperceptible perturbations of a specific pattern for the image modality. To this end, we propose a novel black-box method to generate Universal Adversarial Perturbations (UAPs), which is so called the Effective and T ransferable Universal Adversarial Attack (ETU), aiming to mislead a variety of existing VLP models in a range of downstream tasks. The ETU comprehensively takes into account the characteristics of UAPs and the intrinsic cross-modal interactions to generate effective UAPs. Under this regime, the ETU encourages both global and local utilities of UAPs. This benefits the overall utility while reducing interactions between UAP units, improving the transferability. To further enhance the effectiveness and transferability of UAPs, we also design a novel data augmentation method named ScMix. ScMix consists of self-mix and cross-mix data transformations, which can effectively increase the multi-modal data diversity while preserving the semantics of the original data. Through comprehensive experiments on various downstream tasks, VLP models, and datasets, we demonstrate that the proposed method is able to achieve effective and transferrable universal adversarial attacks.
Title:
How Quality Affects Deep Neural Networks in Fine-Grained Image Classification
Authors: Joseph Smith, Zheming Zuo, Jonathan Stonehouse, Boguslaw Obara
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
In this paper, we propose a No-Reference Image Quality Assessment (NRIQA) guided cut-off point selection (CPS) strategy to enhance the performance of a fine-grained classification system. Scores given by existing NRIQA methods on the same image may vary and not be as independent of natural image augmentations as expected, which weakens their connection and explainability to fine-grained image classification. Taking the three most commonly adopted image augmentation configurations -- cropping, rotating, and blurring -- as the entry point, we formulate a two-step mechanism for selecting the most discriminative subset from a given image dataset by considering both the confidence of model predictions and the density distribution of image qualities over several NRIQA methods. Concretely, the cut-off points yielded by those methods are aggregated via majority voting to inform the process of image subset selection. The efficacy and efficiency of such a mechanism have been confirmed by comparing the models being trained on high-quality images against a combination of high- and low-quality ones, with a range of 0.7% to 4.2% improvement on a commercial product dataset in terms of mean accuracy through four deep neural classifiers. The robustness of the mechanism has been proven by the observations that all the selected high-quality images can work jointly with 70% low-quality images with 1.3% of classification precision sacrificed when using ResNet34 in an ablation study.
Title:
Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition
Abstract
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks.
Title:
Theoretical Guarantees of Data Augmented Last Layer Retraining Methods
Abstract
Ensuring fair predictions across many distinct subpopulations in the training data can be prohibitive for large models. Recently, simple linear last layer retraining strategies, in combination with data augmentation methods such as upweighting, downsampling and mixup, have been shown to achieve state-of-the-art performance for worst-group accuracy, which quantifies accuracy for the least prevalent subpopulation. For linear last layer retraining and the abovementioned augmentations, we present the optimal worst-group accuracy when modeling the distribution of the latent representations (input to the last layer) as Gaussian for each subpopulation. We evaluate and verify our results for both synthetic and large publicly available datasets.
Title:
Distilling Diffusion Models into Conditional GANs
Authors: Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park
Abstract
We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models - DMD, SDXL-Turbo, and SDXL-Lightning - on the zero-shot COCO benchmark.
Keyword: detection
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Keyword: face recognition
Title:
Keyword: augmentation
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title: