Abstract
Streaming data analysis is increasingly required in applications, e.g., IoT, cybersecurity, robotics, mechatronics or cyber-physical systems. Despite its relevance, it is still an emerging field with open challenges. SDO is a recent anomaly detection method designed to meet requirements of speed, interpretability and intuitive parameterization. In this work, we present SDOoop, which extends the capabilities of SDO's streaming version to retain temporal information of data structures. SDOoop spots contextual anomalies undetectable by traditional algorithms, while enabling the inspection of data geometries, clusters and temporal patterns. We used SDOoop to model real network communications in critical infrastructures and extract patterns that disclose their dynamics. Moreover, we evaluated SDOoop with data from intrusion detection and natural science domains and obtained performances equivalent or superior to state-of-the-art approaches. Our results show the high potential of new model-based methods to analyze and explain streaming data. Since SDOoop operates with constant per-sample space and time complexity, it is ideal for big data, being able to instantly process large volumes of information. SDOoop conforms to next-generation machine learning, which, in addition to accuracy and speed, is expected to provide highly interpretable and informative models.
Title:
CLUE: Concept-Level Uncertainty Estimation for Large Language Models
Authors: Yu-Hsiang Wang, Andrew Bai, Che-Ping Tsai, Cho-Jui Hsieh
Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in various natural language generation (NLG) tasks. Previous studies suggest that LLMs' generation process involves uncertainty. However, existing approaches to uncertainty estimation mainly focus on sequence-level uncertainty, overlooking individual pieces of information within sequences. These methods fall short in separately assessing the uncertainty of each component in a sequence. In response, we propose a novel framework for Concept-Level Uncertainty Estimation (CLUE) for LLMs. We leverage LLMs to convert output sequences into concept-level representations, breaking down sequences into individual concepts and measuring the uncertainty of each concept separately. We conduct experiments to demonstrate that CLUE can provide more interpretable uncertainty estimation results compared with sentence-level uncertainty, and could be a useful tool for various tasks such as hallucination detection and story generation.
Title:
Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes
Authors: Mehmet Kerem Turkcan, Ian Li, Chengbo Zang, Javad Ghaderi, Gil Zussman, Zoran Kostic
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
We introduce Boundless, a photo-realistic synthetic data generation system for enabling highly accurate object detection in dense urban streetscapes. Boundless can replace massive real-world data collection and manual ground-truth object annotation (labeling) with an automated and configurable process. Boundless is based on the Unreal Engine 5 (UE5) City Sample project with improvements enabling accurate collection of 3D bounding boxes across different lighting and scene variability conditions. We evaluate the performance of object detection models trained on the dataset generated by Boundless when used for inference on a real-world dataset acquired from medium-altitude cameras. We compare the performance of the Boundless-trained model against the CARLA-trained model and observe an improvement of 7.8 mAP. The results we achieved support the premise that synthetic data generation is a credible methodology for training/fine-tuning scalable object detection models for urban scenes.
Title:
NUMOSIM: A Synthetic Mobility Dataset with Anomaly Detection Benchmarks
Abstract
Collecting real-world mobility data is challenging. It is often fraught with privacy concerns, logistical difficulties, and inherent biases. Moreover, accurately annotating anomalies in large-scale data is nearly impossible, as it demands meticulous effort to distinguish subtle and complex patterns. These challenges significantly impede progress in geospatial anomaly detection research by restricting access to reliable data and complicating the rigorous evaluation, comparison, and benchmarking of methodologies. To address these limitations, we introduce a synthetic mobility dataset, NUMOSIM, that provides a controlled, ethical, and diverse environment for benchmarking anomaly detection techniques. NUMOSIM simulates a wide array of realistic mobility scenarios, encompassing both typical and anomalous behaviours, generated through advanced deep learning models trained on real mobility data. This approach allows NUMOSIM to accurately replicate the complexities of real-world movement patterns while strategically injecting anomalies to challenge and evaluate detection algorithms based on how effectively they capture the interplay between demographic, geospatial, and temporal factors. Our goal is to advance geospatial mobility analysis by offering a realistic benchmark for improving anomaly detection and mobility modeling techniques. To support this, we provide open access to the NUMOSIM dataset, along with comprehensive documentation, evaluation metrics, and benchmark results.
Title:
Can Your Generative Model Detect Out-of-Distribution Covariate Shift?
Authors: Christiaan Viviers, Amaan Valiuddin, Francisco Caetano, Lemar Abdi, Lena Filatova, Peter de With, Fons van der Sommen
Abstract
Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.
Title:
Oddballness: universal anomaly detection with language models
Authors: Filip Graliński, Ryszard Staruch, Krzysztof Jurkiewicz
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
We present a new method to detect anomalies in texts (in general: in sequences of any data), using language models, in a totally unsupervised manner. The method considers probabilities (likelihoods) generated by a language model, but instead of focusing on low-likelihood tokens, it considers a new metric introduced in this paper: oddballness. Oddballness measures how ``strange'' a given token is according to the language model. We demonstrate in grammatical error detection tasks (a specific case of text anomaly detection) that oddballness is better than just considering low-likelihood events, if a totally unsupervised setup is assumed.
Title:
Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection
Authors: Min Wu, Xiaofu Li, Haoze Wu, Clark Barrett
Abstract
Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning model outputs, we present VeriX+, which significantly improves both the size and the generation time of verified explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time -- the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain (Junker 2004) algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of 38% on the GTSRB dataset and a time reduction of 90% on MNIST. We also explore applications of our verified explanations and show that explanation size is a useful proxy for both incorrectness detection and out-of-distribution detection.
Title:
A Comparative Study of Offline Models and Online LLMs in Fake News Detection
Authors: Ruoyu Xu, Gaoxiang Li
Subjects: Subjects:
Social and Information Networks (cs.SI)
Abstract
Fake news detection remains a critical challenge in today's rapidly evolving digital landscape, where misinformation can spread faster than ever before. Traditional fake news detection models often rely on static datasets and auxiliary information, such as metadata or social media interactions, which limits their adaptability to real-time scenarios. Recent advancements in Large Language Models (LLMs) have demonstrated significant potential in addressing these challenges due to their extensive pre-trained knowledge and ability to analyze textual content without relying on auxiliary data. However, many of these LLM-based approaches are still rooted in static datasets, with limited exploration into their real-time processing capabilities. This paper presents a systematic evaluation of both traditional offline models and state-of-the-art LLMs for real-time fake news detection. We demonstrate the limitations of existing offline models, including their inability to adapt to dynamic misinformation patterns. Furthermore, we show that newer LLM models with online capabilities, such as GPT-4, Claude, and Gemini, are better suited for detecting emerging fake news in real-time contexts. Our findings emphasize the importance of transitioning from offline to online LLM models for real-time fake news detection. Additionally, the public accessibility of LLMs enhances their scalability and democratizes the tools needed to combat misinformation. By leveraging real-time data, our work marks a significant step toward more adaptive, effective, and scalable fake news detection systems.
Abstract
Generative models, such as GANs and diffusion models, have been used to augment training sets and boost performances in different tasks. We focus on generative models for cell detection instead, i.e., locating and classifying cells in given pathology images. One important information that has been largely overlooked is the spatial patterns of the cells. In this paper, we propose a spatial-pattern-guided generative model for cell layout generation. Specifically, a novel diffusion model guided by spatial features and generates realistic cell layouts has been proposed. We explore different density models as spatial features for the diffusion model. In downstream tasks, we show that the generated cell layouts can be used to guide the generation of high-quality pathology images. Augmenting with these images can significantly boost the performance of SOTA cell detection methods. The code is available at this https URL.
Title:
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
Abstract
We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at this https URL.
Title:
Towards Autonomous Cybersecurity: An Intelligent AutoML Framework for Autonomous Intrusion Detection
Authors: Li Yang, Abdallah Shami
Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Abstract
The rapid evolution of mobile networks from 5G to 6G has necessitated the development of autonomous network management systems, such as Zero-Touch Networks (ZTNs). However, the increased complexity and automation of these networks have also escalated cybersecurity risks. Existing Intrusion Detection Systems (IDSs) leveraging traditional Machine Learning (ML) techniques have shown effectiveness in mitigating these risks, but they often require extensive manual effort and expert knowledge. To address these challenges, this paper proposes an Automated Machine Learning (AutoML)-based autonomous IDS framework towards achieving autonomous cybersecurity for next-generation networks. To achieve autonomous intrusion detection, the proposed AutoML framework automates all critical procedures of the data analytics pipeline, including data pre-processing, feature engineering, model selection, hyperparameter tuning, and model ensemble. Specifically, it utilizes a Tabular Variational Auto-Encoder (TVAE) method for automated data balancing, tree-based ML models for automated feature selection and base model learning, Bayesian Optimization (BO) for hyperparameter optimization, and a novel Optimized Confidence-based Stacking Ensemble (OCSE) method for automated model ensemble. The proposed AutoML-based IDS was evaluated on two public benchmark network security datasets, CICIDS2017 and 5G-NIDD, and demonstrated improved performance compared to state-of-the-art cybersecurity methods. This research marks a significant step towards fully autonomous cybersecurity in next-generation networks, potentially revolutionizing network security applications.
Title:
Addressing the Gaps in Early Dementia Detection: A Path Towards Enhanced Diagnostic Models through Machine Learning
Abstract
The rapid global aging trend has led to an increase in dementia cases, including Alzheimer's disease, underscoring the urgent need for early and accurate diagnostic methods. Traditional diagnostic techniques, such as cognitive tests, neuroimaging, and biomarker analysis, face significant limitations in sensitivity, accessibility, and cost, particularly in the early stages. This study explores the potential of machine learning (ML) as a transformative approach to enhance early dementia detection by leveraging ML models to analyze and integrate complex multimodal datasets, including cognitive assessments, neuroimaging, and genetic information. A comprehensive review of existing literature was conducted to evaluate various ML models, including supervised learning, deep learning, and advanced techniques such as ensemble learning and transformer models, assessing their accuracy, interpretability, and potential for clinical integration. The findings indicate that while ML models show significant promise in improving diagnostic precision and enabling earlier interventions, challenges remain in their generalizability, interpretability, and ethical deployment. This research concludes by outlining future directions aimed at enhancing the clinical utility of ML models in dementia detection, emphasizing interdisciplinary collaboration and ethically sound frameworks to improve early detection and intervention strategies for Alzheimer's disease and other forms of dementia.
Abstract
Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
Title:
Active Fake: DeepFake Camouflage
Authors: Pu Sun, Honggang Qi, Yuezun Li
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
DeepFake technology has gained significant attention due to its ability to manipulate facial attributes with high realism, raising serious societal concerns. Face-Swap DeepFake is the most harmful among these techniques, which fabricates behaviors by swapping original faces with synthesized ones. Existing forensic methods, primarily based on Deep Neural Networks (DNNs), effectively expose these manipulations and have become important authenticity indicators. However, these methods mainly concentrate on capturing the blending inconsistency in DeepFake faces, raising a new security issue, termed Active Fake, emerges when individuals intentionally create blending inconsistency in their authentic videos to evade responsibility. This tactic is called DeepFake Camouflage. To achieve this, we introduce a new framework for creating DeepFake camouflage that generates blending inconsistencies while ensuring imperceptibility, effectiveness, and transferability. This framework, optimized via an adversarial learning strategy, crafts imperceptible yet effective inconsistencies to mislead forensic detectors. Extensive experiments demonstrate the effectiveness and robustness of our method, highlighting the need for further research in active fake detection.
Title:
Bi-capacity Choquet Integral for Sensor Fusion with Label Uncertainty
Abstract
Sensor fusion combines data from multiple sensor sources to improve reliability, robustness, and accuracy of data interpretation. The Fuzzy Integral (FI), in particular, the Choquet integral (ChI), is often used as a powerful nonlinear aggregator for fusion across multiple sensors. However, existing supervised ChI learning algorithms typically require precise training labels for each input data point, which can be difficult or impossible to obtain. Additionally, prior work on ChI fusion is often based only on the normalized fuzzy measures, which bounds the fuzzy measure values between [0, 1]. This can be limiting in cases where the underlying scales of input data sources are bipolar (i.e., between [-1, 1]). To address these challenges, this paper proposes a novel Choquet integral-based fusion framework, named Bi-MIChI (pronounced "bi-mi-kee"), which uses bi-capacities to represent the interactions between pairs of subsets of the input sensor sources on a bi-polar scale. This allows for extended non-linear interactions between the sensor sources and can lead to interesting fusion results. Bi-MIChI also addresses label uncertainty through Multiple Instance Learning, where training labels are applied to "bags" (sets) of data instead of per-instance. Our proposed Bi-MIChI framework shows effective classification and detection performance on both synthetic and real-world experiments for sensor fusion with label uncertainty. We also provide detailed analyses on the behavior of the fuzzy measures to demonstrate our fusion process.
Title:
Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection
Abstract
Detecting anomalies in human-related videos is crucial for surveillance applications. Current methods primarily include appearance-based and action-based techniques. Appearance-based methods rely on low-level visual features such as color, texture, and shape. They learn a large number of pixel patterns and features related to known scenes during training, making them effective in detecting anomalies within these familiar contexts. However, when encountering new or significantly changed scenes, i.e., unknown scenes, they often fail because existing SOTA methods do not effectively capture the relationship between actions and their surrounding scenes, resulting in low generalization. In contrast, action-based methods focus on detecting anomalies in human actions but are usually less informative because they tend to overlook the relationship between actions and their scenes, leading to incorrect detection. For instance, the normal event of running on the beach and the abnormal event of running on the street might both be considered normal due to the lack of scene information. In short, current methods struggle to integrate low-level visual and high-level action features, leading to poor anomaly detection in varied and complex scenes. To address this challenge, we propose a novel decoupling-based architecture for human-related video anomaly detection (DecoAD). DecoAD significantly improves the integration of visual and action features through the decoupling and interweaving of scenes and actions, thereby enabling a more intuitive and accurate understanding of complex behaviors and scenes. DecoAD supports fully supervised, weakly supervised, and unsupervised settings.
Title:
YOLO-PPA based Efficient Traffic Sign Detection for Cruise Control in Autonomous Driving
Abstract
It is very important to detect traffic signs efficiently and accurately in autonomous driving systems. However, the farther the distance, the smaller the traffic signs. Existing object detection algorithms can hardly detect these small scaled this http URL addition, the performance of embedded devices on vehicles limits the scale of detection this http URL address these challenges, a YOLO PPA based traffic sign detection algorithm is proposed in this paper.The experimental results on the GTSDB dataset show that compared to the original YOLO, the proposed method improves inference efficiency by 11.2%. The mAP 50 is also improved by 93.2%, which demonstrates the effectiveness of the proposed YOLO PPA.
Title:
Training-free Conversion of Pretrained ANNs to SNNs for Low-Power and High-Performance Applications
Authors: Tong Bu, Maohua Li, Zhaofei Yu
Subjects: Subjects:
Neural and Evolutionary Computing (cs.NE)
Abstract
Spiking Neural Networks (SNNs) have emerged as a promising substitute for Artificial Neural Networks (ANNs) due to their advantages of fast inference and low power consumption. However, the lack of efficient training algorithms has hindered their widespread adoption. Existing supervised learning algorithms for SNNs require significantly more memory and time than their ANN counterparts. Even commonly used ANN-SNN conversion methods necessitate re-training of ANNs to enhance conversion efficiency, incurring additional computational costs. To address these challenges, we propose a novel training-free ANN-SNN conversion pipeline. Our approach directly converts pre-trained ANN models into high-performance SNNs without additional training. The conversion pipeline includes a local-learning-based threshold balancing algorithm, which enables efficient calculation of the optimal thresholds and fine-grained adjustment of threshold value by channel-wise scaling. We demonstrate the scalability of our framework across three typical computer vision tasks: image classification, semantic segmentation, and object detection. This showcases its applicability to both classification and regression tasks. Moreover, we have evaluated the energy consumption of the converted SNNs, demonstrating their superior low-power advantage compared to conventional ANNs. Our training-free algorithm outperforms existing methods, highlighting its practical applicability and efficiency. This approach simplifies the deployment of SNNs by leveraging open-source pre-trained ANN models and neuromorphic hardware, enabling fast, low-power inference with negligible performance reduction.
Title:
Fast Payload Calibration for Sensorless Contact Estimation Using Model Pre-training
Abstract
Force and torque sensing is crucial in robotic manipulation across both collaborative and industrial settings. Traditional methods for dynamics identification enable the detection and control of external forces and torques without the need for costly sensors. However, these approaches show limitations in scenarios where robot dynamics, particularly the end-effector payload, are subject to changes. Moreover, existing calibration techniques face trade-offs between efficiency and accuracy due to concerns over joint space coverage. In this paper, we introduce a calibration scheme that leverages pre-trained Neural Network models to learn calibrated dynamics across a wide range of joint space in advance. This offline learning strategy significantly reduces the need for online data collection, whether for selection of the optimal model or identification of payload features, necessitating merely a 4-second trajectory for online calibration. This method is particularly effective in tasks that require frequent dynamics recalibration for precise contact estimation. We further demonstrate the efficacy of this approach through applications in sensorless joint and task compliance, accounting for payload variability.
Title:
Reinforcement Learning Approach to Optimizing Profilometric Sensor Trajectories for Surface Inspection
Authors: Sara Roos-Hoefgeest, Mario Roos-Hoefgeest, Ignacio Alvarez, Rafael C. González
Abstract
High-precision surface defect detection in manufacturing is essential for ensuring quality control. Laser triangulation profilometric sensors are key to this process, providing detailed and accurate surface measurements over a line. To achieve a complete and precise surface scan, accurate relative motion between the sensor and the workpiece is required. It is crucial to control the sensor pose to maintain optimal distance and relative orientation to the surface. It is also important to ensure uniform profile distribution throughout the scanning process. This paper presents a novel Reinforcement Learning (RL) based approach to optimize robot inspection trajectories for profilometric sensors. Building upon the Boustrophedon scanning method, our technique dynamically adjusts the sensor position and tilt to maintain optimal orientation and distance from the surface, while also ensuring a consistent profile distance for uniform and high-quality scanning. Utilizing a simulated environment based on the CAD model of the part, we replicate real-world scanning conditions, including sensor noise and surface irregularities. This simulation-based approach enables offline trajectory planning based on CAD models. Key contributions include the modeling of the state space, action space, and reward function, specifically designed for inspection applications using profilometric sensors. We use Proximal Policy Optimization (PPO) algorithm to efficiently train the RL agent, demonstrating its capability to optimize inspection trajectories with profilometric sensors. To validate our approach, we conducted several experiments where a model trained on a specific training piece was tested on various parts in simulation. Also, we conducted a real-world experiment by executing the optimized trajectory, generated offline from a CAD model, to inspect a part using a UR3e robotic arm model.
Title:
UV-Mamba: A DCN-Enhanced State Space Model for Urban Village Boundary Identification in High-Resolution Remote Sensing Images
Authors: Lulin Li, Ben Chen, Xuechao Zou, Junliang Xing, Pin Tao
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
Title:
An innovation-based cycle-slip, multipath estimation, detection and mitigation method for tightly coupled GNSS/INS/Vision navigation in urban areas
Authors: Bo Xu, Shoujian Zhang, Jingrong Wang, Jiancheng Li
Abstract
Precise, consistent, and reliable positioning is crucial for a multitude of uses. In order to achieve high precision global positioning services, multi-sensor fusion techniques, such as the Global Navigation Satellite System (GNSS)/Inertial Navigation System (INS)/Vision integration system, combine the strengths of various sensors. This technique is essential for localization in complex environments and has been widely used in the mass market. However, frequent signal deterioration and blocking in urban environments exacerbates the degradation of GNSS positioning and negatively impacts the performance of the multi-sensor integration system. For GNSS pseudorange and carrier phase observation data in the urban environment, we offer an innovation-based cycle slip/multipath estimation, detection, and mitigation (I-EDM) method to reduce the influence of multipath effects and cycle slips on location induced by obstruction in urban settings. The method obtains the innovations of GNSS observations with the cluster analysis method. Then the innovations are used to detect the cycle slips and multipath. Compared with the residual-based method, the innovation-based method avoids the residual overfitting caused by the least square method, resulting in better detection of outliers within the GNSS observations. The vehicle tests carried out in urban settings verify the proposed approach. Experimental results indicate that the accuracy of 0.23m, 0.11m, and 0.31m in the east, north and up components can be achieved by the GNSS/INS/Vision tightly coupled system with the I-EDM method, which has a maximum of 21.6% improvement when compared with the residual-based EDM (R-EDM) method.
Title:
LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones
Authors: Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at this https URL altair199797/LowFormer.
Title:
Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks
Authors: Lorenzo Bini, Marco Sorbi, Stephane Marchand-Maillet
Abstract
Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.
Title:
Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation
Authors: Prerak Mody, Nicolas F. Chaves-de-Plaza, Chinmay Rao, Eleftheria Astrenidou, Mischa de Ridder, Nienke Hoekstra, Klaus Hildebrandt, Marius Staring
Abstract
Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at this https URL
Title:
Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift
Abstract
Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.
Title:
CTMBIDS: Convolutional Tsetlin Machine Based Intrusion Detection System for DDoS attacks in an SDN environment
Authors: Rasoul Jafari Gohari, Laya Aliahmadipour, Marjan Kuchaki Rafsanjani
Subjects: Subjects:
Cryptography and Security (cs.CR)
Abstract
Software Defined Networks (SDN) face many security challenges today. A great deal of research has been done within the field of Intrusion Detection Systems (IDS) in these networks. Yet, numerous approaches still rely on deep learning algorithms. These algorithms suffer from complexity in implementation, high processing power and high memory consumption. In addition to security issues, firstly, the number of datasets that are based on SDN protocols are very small. Secondly, the ones that are available encompass numerous attacks in the network and do not focus on a single attack. For this reason, to introduce an SDN-based IDS with a focus on Distributed Denial of Service (DDoS) attacks, it is necessary to generate a DDoS-oriented dataset whose features can train a high-quality IDS. In this work, in order to address two important challenges in SDNs, initially, we generate three DDoS attack datasets based on three common and different network topologies. In the second step, using the Convolutional Tsetlin Machine (CTM), we introduce a lightweight IDS for DDoS attack dubbed CTMBIDS. The lightweight nature of the CTMBIDS stems from its low memory consumption and also its interpretability compared to the existing complex deep learning models. The low usage of system resources for the CTMBIDS makes it an ideal choice for an optimal software that consumes the SDN controllers least amount of memory. Also, in order to ascertain the quality of the generated datasets, we compare the CTMBIDS empirical results with the DDoS attacks of the KDDCup99 benchmark dataset as well. Since the main focus of this work is on a lightweight IDS, the results show the CTMBIDS performs much more efficiently than deep learning based approaches. Furthermore, the results also show in most datasets, the proposed method has relatively equal or better accuracy and also consumes much less memory than the existing methods.
Title:
A Silicon Photonic Neural Network for Chromatic Dispersion Compensation in 20 Gbps PAM4 Signal at 125 km and Its Scalability up to 100 Gbps
Authors: Emiliano Staffoli, Gianpietro Maddinelli, Lorenzo Pavesi
Abstract
A feed-forward photonic neural network (PNN) is tested for chromatic dispersion compensation in Intensity Modulation/Direct Detection optical links. The PNN is based on a sequence of linear and nonlinear transformations. The linear stage is constituted by an 8-tap time-delayed complex perceptron implemented on a Silicon-On-insulator platform and acting as a tunable optical filter. The nonlinear stage is provided by the square modulus of the electrical field applied at the end-of-line photodetector. The training maximizes the separation between the optical levels (i.e. the eye diagram aperture), with consequent reduction of the Bit Error Rate. Effective equalization is experimentally demonstrated for 20 Gbps 4-level Pulse Amplitude Modulated signal up to 125 km. An evolutionary algorithm and a gradient-based approach are tested for the training and then compared in terms of repeatability and convergence time. The optimal weights resulting from the training are interpreted in light of the theoretical transfer function of the optical fiber. Finally, a simulative study proves the scalability of the layout to larger bandwidths, up to 100 Gbps.
Title:
Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Cord Paralysis
Authors: Yucong Zhang, Xin Zou, Jinshan Yang, Wenjun Chen, Faya Liang, Ming Li
Abstract
This paper presents the Multimodal Analyzing System for Laryngoscope (MASL), a system that combines audio and video data to automatically extract key segments and metrics from laryngeal videostroboscopic videos for clinical assessment. MASL integrates glottis detection with keyword spotting to analyze patient vocalizations and refine video highlights for better inspection of vocal cord movements. The system includes a strobing video extraction module that identifies frames by analyzing hue, saturation, and value fluctuations. MASL also provides effective metrics for vocal cord paralysis detection, employing a two-stage glottis segmentation process using U-Net followed by diffusion-based refinement to reduce false positives. Instead of glottal area waveforms, MASL estimates anterior glottic angle waveforms (AGAW) from glottis masks, evaluating both left and right vocal cords to detect unilateral vocal cord paralysis (UVFP). By comparing AGAW variances, MASL distinguishes between left and right paralysis. Ablation studies and experiments on public and real-world datasets validate MASL's segmentation module and demonstrate its ability to provide reliable metrics for UVFP diagnosis.
Title:
CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Authors: Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, Conghui He
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Abstract
Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.
Title:
Unsupervised Anomaly Detection and Localization with Generative Adversarial Networks
Abstract
We propose a novel unsupervised anomaly detection approach using generative adversarial networks and SOP-derived spectrograms. Demonstrating remarkable efficacy, our method achieves over 97% accuracy on SOP datasets from both submarine and terrestrial fiber links, all achieved without the need for labelled data.
Title:
Wind turbine condition monitoring based on intra- and inter-farm federated learning
Authors: Albin Grataloup, Stefan Jonas, Angela Meyer
Subjects: Subjects:
Machine Learning (cs.LG); Systems and Control (eess.SY)
Abstract
As wind energy adoption is growing, ensuring the efficient operation and maintenance of wind turbines becomes essential for maximizing energy production and minimizing costs and downtime. Many AI applications in wind energy, such as in condition monitoring and power forecasting, may benefit from using operational data not only from individual wind turbines but from multiple turbines and multiple wind farms. Collaborative distributed AI which preserves data privacy holds a strong potential for these applications. Federated learning has emerged as a privacy-preserving distributed machine learning approach in this context. We explore federated learning in wind turbine condition monitoring, specifically for fault detection using normal behaviour models. We investigate various federated learning strategies, including collaboration across different wind farms and turbine models, as well as collaboration restricted to the same wind farm and turbine model. Our case study results indicate that federated learning across multiple wind turbines consistently outperforms models trained on a single turbine, especially when training data is scarce. Moreover, the amount of historical data necessary to train an effective model can be significantly reduced by employing a collaborative federated learning strategy. Finally, our findings show that extending the collaboration to multiple wind farms may result in inferior performance compared to restricting learning within a farm, specifically when faced with statistical heterogeneity and imbalanced datasets.
Keyword: face recognition
Title:
Vec2Face: Scaling Face Dataset Generation with Loosely Constrained Vectors
Authors: Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
This paper studies how to synthesize face images of non-existent persons, to create a dataset that allows effective training of face recognition (FR) models. Two important goals are (1) the ability to generate a large number of distinct identities (inter-class separation) with (2) a wide variation in appearance of each identity (intra-class variation). However, existing works 1) are typically limited in how many well-separated identities can be generated and 2) either neglect or use a separate editing model for attribute augmentation. We propose Vec2Face, a holistic model that uses only a sampled vector as input and can flexibly generate and control face images and their attributes. Composed of a feature masked autoencoder and a decoder, Vec2Face is supervised by face image reconstruction and can be conveniently used in inference. Using vectors with low similarity among themselves as inputs, Vec2Face generates well-separated identities. Randomly perturbing an input identity vector within a small range allows Vec2Face to generate faces of the same identity with robust variation in face attributes. It is also possible to generate images with designated attributes by adjusting vector values with a gradient descent method. Vec2Face has efficiently synthesized as many as 300K identities with 15 million total images, whereas 60K is the largest number of identities created in the previous works. FR models trained with the generated HSFace datasets, from 10k to 300k identities, achieve state-of-the-art accuracy, from 92% to 93.52%, on five real-world test sets. For the first time, our model created using a synthetic training set achieves higher accuracy than the model created using a same-scale training set of real face images (on the CALFW test set).
Title:
TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces
Authors: Bernardo Biesseck, Pedro Vidal, Luiz Coelho, Roger Granada, David Menotti|
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: this https URL.
Keyword: augmentation
Title:
Vec2Face: Scaling Face Dataset Generation with Loosely Constrained Vectors
Authors: Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
This paper studies how to synthesize face images of non-existent persons, to create a dataset that allows effective training of face recognition (FR) models. Two important goals are (1) the ability to generate a large number of distinct identities (inter-class separation) with (2) a wide variation in appearance of each identity (intra-class variation). However, existing works 1) are typically limited in how many well-separated identities can be generated and 2) either neglect or use a separate editing model for attribute augmentation. We propose Vec2Face, a holistic model that uses only a sampled vector as input and can flexibly generate and control face images and their attributes. Composed of a feature masked autoencoder and a decoder, Vec2Face is supervised by face image reconstruction and can be conveniently used in inference. Using vectors with low similarity among themselves as inputs, Vec2Face generates well-separated identities. Randomly perturbing an input identity vector within a small range allows Vec2Face to generate faces of the same identity with robust variation in face attributes. It is also possible to generate images with designated attributes by adjusting vector values with a gradient descent method. Vec2Face has efficiently synthesized as many as 300K identities with 15 million total images, whereas 60K is the largest number of identities created in the previous works. FR models trained with the generated HSFace datasets, from 10k to 300k identities, achieve state-of-the-art accuracy, from 92% to 93.52%, on five real-world test sets. For the first time, our model created using a synthetic training set achieves higher accuracy than the model created using a same-scale training set of real face images (on the CALFW test set).
Title:
Online Scheduling via Gradient Descent for Weighted Flow Time Minimization
Authors: Qingyun Chen, Sungjin Im, Aditya Petety
Subjects: Subjects:
Data Structures and Algorithms (cs.DS)
Abstract
In this paper, we explore how a natural generalization of Shortest Remaining Processing Time (SRPT) can be a powerful \emph{meta-algorithm} for online scheduling. The meta-algorithm processes jobs to maximally reduce the objective of the corresponding offline scheduling problem of the remaining jobs: minimizing the total weighted completion time of them (the residual optimum). We show that it achieves scalability for minimizing total weighted flow time when the residual optimum exhibits \emph{supermodularity}. Scalability here means it is $O(1)$-competitive with an arbitrarily small speed augmentation advantage over the adversary, representing the best possible outcome achievable for various scheduling problems. Thanks to this finding, our approach does not require the residual optimum to have a closed mathematical form. Consequently, we can obtain the schedule by solving a linear program, which makes our approach readily applicable to a rich body of applications. Furthermore, by establishing a novel connection to \emph{substitute valuations in Walrasian markets}, we show how to achieve supermodularity, thereby obtaining scalable algorithms for various scheduling problems, such as matroid scheduling, generalized network flow, and generalized arbitrary speed-up curves, etc., and this is the \emph{first} non-trivial or scalable algorithm for many of them.
Title:
PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning
Abstract
Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.Our code has been open sourced at this https URL.
Title:
An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
Abstract
Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
Title:
Labeled-to-Unlabeled Distribution Alignment for Partially-Supervised Multi-Organ Medical Image Segmentation
Abstract
Partially-supervised multi-organ medical image segmentation aims to develop a unified semantic segmentation model by utilizing multiple partially-labeled datasets, with each dataset providing labels for a single class of organs. However, the limited availability of labeled foreground organs and the absence of supervision to distinguish unlabeled foreground organs from the background pose a significant challenge, which leads to a distribution mismatch between labeled and unlabeled pixels. Although existing pseudo-labeling methods can be employed to learn from both labeled and unlabeled pixels, they are prone to performance degradation in this task, as they rely on the assumption that labeled and unlabeled pixels have the same distribution. In this paper, to address the problem of distribution mismatch, we propose a labeled-to-unlabeled distribution alignment (LTUDA) framework that aligns feature distributions and enhances discriminative capability. Specifically, we introduce a cross-set data augmentation strategy, which performs region-level mixing between labeled and unlabeled organs to reduce distribution discrepancy and enrich the training set. Besides, we propose a prototype-based distribution alignment method that implicitly reduces intra-class variation and increases the separation between the unlabeled foreground and background. This can be achieved by encouraging consistency between the outputs of two prototype classifiers and a linear classifier. Extensive experimental results on the AbdomenCT-1K dataset and a union of four benchmark datasets (including LiTS, MSD-Spleen, KiTS, and NIH82) demonstrate that our method outperforms the state-of-the-art partially-supervised methods by a considerable margin, and even surpasses the fully-supervised methods. The source code is publicly available at this https URL.
Title:
Few-Shot Continual Learning for Activity Recognition in Classroom Surveillance Images
Abstract
The application of activity recognition in the "AI + Education" field is gaining increasing attention. However, current work mainly focuses on the recognition of activities in manually captured videos and a limited number of activity types, with little attention given to recognizing activities in surveillance images from real classrooms. In real classroom settings, normal teaching activities such as reading, account for a large proportion of samples, while rare non-teaching activities such as eating, continue to appear. This requires a model that can learn non-teaching activities from few samples without forgetting the normal teaching activities, which necessitates fewshot continual learning (FSCL) capability. To address this gap, we constructed a continual learning dataset focused on classroom surveillance image activity recognition called ARIC (Activity Recognition in Classroom). The dataset has advantages such as multiple perspectives, a wide variety of activities, and real-world scenarios, but it also presents challenges like similar activities and imbalanced sample distribution. To overcome these challenges, we designed a few-shot continual learning method that combines supervised contrastive learning (SCL) and an adaptive covariance classifier (ACC). During the base phase, we proposed a SCL approach based on feature augmentation to enhance the model's generalization ability. In the incremental phase, we employed an ACC to more accurately describe the distribution of new classes. Experimental results demonstrate that our method outperforms other existing methods on the ARIC dataset.
Title:
UV-Mamba: A DCN-Enhanced State Space Model for Urban Village Boundary Identification in High-Resolution Remote Sensing Images
Authors: Lulin Li, Ben Chen, Xuechao Zou, Junliang Xing, Pin Tao
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
Title:
Towards Data-Centric Face Anti-Spoofing: Improving Cross-domain Generalization via Physics-based Data Synthesis
Authors: Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex Kot
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, \textit{etc}. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at this https URL.
Title:
Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift
Abstract
Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.
Title:
View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
Abstract
Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at this https URL.
Keyword: detection
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Keyword: face recognition
Title:
Title:
Keyword: augmentation
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title: