Abstract
In developing medical interventions using untethered milli- and microrobots, ensuring safety and effectiveness relies on robust methods for detection, real-time tracking, and precise localization within the body. However, the inherent non-transparency of the human body poses a significant obstacle, limiting robot detection primarily to specialized imaging systems such as X-ray fluoroscopy, which often lack crucial anatomical details. Consequently, the robot operator (human or machine) would encounter severe challenges in accurately determining the location of the robot and steering its motion. This study explores the feasibility of circumventing this challenge by creating a simulation environment that contains the precise digital replica (virtual twin) of a model microrobot operational workspace. Synchronizing coordinate systems between the virtual and real worlds and continuously integrating microrobot position data from the image stream into the virtual twin allows the microrobot operator to control navigation in the virtual world. We validate this concept by demonstrating the tracking and steering of a mobile magnetic robot in confined phantoms with high temporal resolution (< 100 ms, with an average of ~20 ms) visual feedback. Additionally, our object detection-based localization approach offers the potential to reduce overall patient exposure to X-ray doses during continuous microrobot tracking without compromising tracking accuracy. Ultimately, we address a critical gap in developing image-guided remote interventions with untethered medical microrobots, particularly for near-future applications in animal models and human patients.
Title:
A Surveillance Game between a Differential Drive Robot and an Omnidirectional Agent: The Case of a Faster Evader
Abstract
A fundamental task in mobile robotics is to keep an agent under surveillance using an autonomous robotic platform equipped with a sensing device. Using differential game theory, we study a particular setup of the previous problem. A Differential Drive Robot (DDR) equipped with a bounded range sensor wants to keep surveillance of an Omnidirectional Agent (OA). The goal of the DDR is to maintain the OA inside its detection region for as much time as possible, while the OA, having the opposite goal, wants to leave the regions as soon as possible. We formulate the problem as a zero-sum differential game, and we compute the time-optimal motion strategies of the players to achieve their goals. We focus on the case where the OA is faster than the DDR. Given the OA's speed advantage, a winning strategy for the OA is always moving radially outwards to the DDR's position. However, this work shows that even though the previous strategy could be optimal in some cases, more complex motion strategies emerge based on the players' speed ratio. In particular, we exhibit that four classes of singular surfaces may appear in this game: Dispersal, Transition, Universal, and Focal surfaces. Each one of those surfaces implies a particular motion strategy for the players.
Title:
An Intent Modeling and Inference Framework for Autonomous and Remotely Piloted Aerial Systems
Abstract
An intent modelling and inference framework is presented to assist the defense planning for protecting a geo-fence against unauthorized flights. First, a novel mathematical definition for the intent of an uncrewed aircraft system (UAS) is presented. The concepts of critical waypoints and critical waypoint patterns are introduced and associated with a motion process to fully characterize an intent. This modelling framework consists of representations of a UAS mission planner, used to plan the aircraft's motion sequence, as well as a defense planner, defined to protect the geo-fence. It is applicable to autonomous, semi-autonomous, and piloted systems in 2D and 3D environments with obstacles. The framework is illustrated by defining a library of intents for a security application. Detection and tracking of the target are presumed for formulating the intent inference problem. Multiple formulations of the decision maker's objective are discussed as part of a deep-learning-based methodology. Further, a multi-modal dynamic model for characterizing the UAS flight is discussed. This is later utilized to extract features using the interacting multiple model (IMM) filter for training the intent classifier. Finally, as part of the simulation study, an attention-based bi-directional long short-term memory (Bi-LSTM) network for intent inference is presented. The simulation experiments illustrate various aspects of the framework, including trajectory generation, radar measurement simulation, etc., in 2D and 3D environments.
Title:
A BERT-Based Summarization approach for depression detection
Abstract
Depression is a globally prevalent mental disorder with potentially severe repercussions if not addressed, especially in individuals with recurrent episodes. Prior research has shown that early intervention has the potential to mitigate or alleviate symptoms of depression. However, implementing such interventions in a real-world setting may pose considerable challenges. A promising strategy involves leveraging machine learning and artificial intelligence to autonomously detect depression indicators from diverse data sources. One of the most widely available and informative data sources is text, which can reveal a person's mood, thoughts, and feelings. In this context, virtual agents programmed to conduct interviews using clinically validated questionnaires, such as those found in the DAIC-WOZ dataset, offer a robust means for depression detection through linguistic analysis. Utilizing BERT-based models, which are powerful and versatile yet use fewer resources than contemporary large language models, to convert text into numerical representations significantly enhances the precision of depression diagnosis. These models adeptly capture complex semantic and syntactic nuances, improving the detection accuracy of depressive symptoms. Given the inherent limitations of these models concerning text length, our study proposes text summarization as a preprocessing technique to diminish the length and intricacies of input texts. Implementing this method within our uniquely developed framework for feature extraction and classification yielded an F1-score of 0.67 on the test set surpassing all prior benchmarks and 0.81 on the validation set exceeding most previous results on the DAIC-WOZ dataset. Furthermore, we have devised a depression lexicon to assess summary quality and relevance. This lexicon constitutes a valuable asset for ongoing research in depression detection.
Title:
Identifying Human Indoor Daily Life Behavior employing Thermal Sensor Arrays (TSAs)
Authors: Dina E. Abdelaleem, Hassan M. Ahmed, M. Sami Soliman, Tarek M. Said
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
Abstract
Daily activity monitoring systems used in households provide vital information for health status, particularly with aging residents. Multiple approaches have been introduced to achieve such goals, typically obtrusive and non-obtrusive. Amongst the obtrusive approaches are the wearable devices, and among the non-obtrusive approaches are the movement detection systems, including motion sensors and thermal sensor arrays (TSAs). TSA systems are advantageous when preserving a person's privacy and picking his precise spatial location. In this study, human daily living activities were monitored day and night, constructing the corresponding activity time series and spatial probability distribution and employing a TSA system. The monitored activities are classified into two categories: sleeping and daily activity. Results showed the possibility of distinguishing between classes regardless of day and night. The obtained sleep activity duration was compared with previous research using the same raw data. Results showed that the duration of sleep activity, on average, was 9 hours/day, and daily life activity was 7 hours/day. The person's spatial probability distribution was determined using the bivariate distribution for the monitored location. In conclusion, the results showed that sleeping activity was dominant. Our study showed that TSAs were the optimum choice when monitoring human activity. Our proposed approach tackled limitations encountered by previous human activity monitoring systems, such as preserving human privacy while knowing his precise spatial location.
Title:
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Abstract
Open-vocabulary detection (OVD) aims to detect objects beyond a predefined set of categories. As a pioneering model incorporating the YOLO series into OVD, YOLO-World is well-suited for scenarios prioritizing speed and efficiency.However, its performance is hindered by its neck feature fusion mechanism, which causes the quadratic complexity and the limited guided receptive this http URL address these limitations, we present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. It leverages multi-modal input sequences and mamba hidden states to guide the selective scanning process.Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings while maintaining comparable parameters and FLOPs. Additionally, it surpasses existing state-of-the-art OVD methods with fewer parameters and FLOPs.
Title:
Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing
Authors: Weiran Lyu, Raghavendra Sridharamurthy, Jeff M. Phillips, Bei Wang
Abstract
Scalar field comparison is a fundamental task in scientific visualization. In topological data analysis, we compare topological descriptors of scalar fields -- such as persistence diagrams and merge trees -- because they provide succinct and robust abstract representations. Several similarity measures for topological descriptors seem to be both asymptotically and practically efficient with polynomial time algorithms, but they do not scale well when handling large-scale, time-varying scientific data and ensembles. In this paper, we propose a new framework to facilitate the comparative analysis of merge trees, inspired by tools from locality sensitive hashing (LSH). LSH hashes similar objects into the same hash buckets with high probability. We propose two new similarity measures for merge trees that can be computed via LSH, using new extensions to Recursive MinHash and subpath signature, respectively. Our similarity measures are extremely efficient to compute and closely resemble the results of existing measures such as merge tree edit distance or geometric interleaving distance. Our experiments demonstrate the utility of our LSH framework in applications such as shape matching, clustering, key event detection, and ensemble summarization.
Title:
MAPX: An explainable model-agnostic framework for the detection of false information on social media networks
Authors: Sarah Condran, Michael Bewong, Selasi Kwashie, Md Zahidul Islam, Irfan Altas, Joshua Condran
Subjects: Subjects:
Social and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
The automated detection of false information has become a fundamental task in combating the spread of "fake news" on online social media networks (OSMN) as it reduces the need for manual discernment by individuals. In the literature, leveraging various content or context features of OSMN documents have been found useful. However, most of the existing detection models often utilise these features in isolation without regard to the temporal and dynamic changes oft-seen in reality, thus, limiting the robustness of the models. Furthermore, there has been little to no consideration of the impact of the quality of documents' features on the trustworthiness of the final prediction. In this paper, we introduce a novel model-agnostic framework, called MAPX, which allows evidence based aggregation of predictions from existing models in an explainable manner. Indeed, the developed aggregation method is adaptive, dynamic and considers the quality of OSMN document features. Further, we perform extensive experiments on benchmarked fake news datasets to demonstrate the effectiveness of MAPX using various real-world data quality scenarios. Our empirical results show that the proposed framework consistently outperforms all state-of-the-art models evaluated. For reproducibility, a demo of MAPX is available at \href{this https URL}{this link}
Title:
1D-CNN-IDS: 1D CNN-based Intrusion Detection System for IIoT
Authors: Muhammad Arslan, Muhammad Mubeen, Muhammad Bilal, Saadullah Farooq Abbasi
Subjects: Subjects:
Cryptography and Security (cs.CR)
Abstract
The demand of the Internet of Things (IoT) has witnessed exponential growth. These progresses are made possible by the technological advancements in artificial intelligence, cloud computing, and edge computing. However, these advancements exhibit multiple challenges, including cyber threats, security and privacy concerns, and the risk of potential financial losses. For this reason, this study developed a computationally inexpensive one-dimensional convolutional neural network (1DCNN) algorithm for cyber-attack classification. The proposed study achieved an accuracy of 99.90% to classify nine cyber-attacks. Multiple other performance metrices have been evaluated to validate the efficacy of the proposed scheme. In addition, comparison has been done with existing state-of-the-art schemes. The findings of the proposed study can significantly contribute to the development of secure intrusion detection for IIoT systems.
Title:
Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection
Authors: Hyewon Park, Hyejin Park, Jueun Ko, Dongbo Min
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Continual Test Time Adaptation (CTTA) has emerged as a critical approach for bridging the domain gap between the controlled training environments and the real-world scenarios, enhancing model adaptability and robustness. Existing CTTA methods, typically categorized into Full-Tuning (FT) and Efficient-Tuning (ET), struggle with effectively addressing domain shifts. To overcome these challenges, we propose Hybrid-TTA, a holistic approach that dynamically selects instance-wise tuning method for optimal adaptation. Our approach introduces the Dynamic Domain Shift Detection (DDSD) strategy, which identifies domain shifts by leveraging temporal correlations in input sequences and dynamically switches between FT and ET to adapt to varying domain shifts effectively. Additionally, the Masked Image Modeling based Adaptation (MIMA) framework is integrated to ensure domain-agnostic robustness with minimal computational overhead. Our Hybrid-TTA achieves a notable 1.6%p improvement in mIoU on the Cityscapes-to-ACDC benchmark dataset, surpassing previous state-of-the-art methods and offering a robust solution for real-world continual adaptation challenges.
Title:
Learning Short Codes for Fading Channels with No or Receiver-Only Channel State Information
Authors: Rishabh Sharad Pomaje, Rajshekhar V Bhat
Subjects: Subjects:
Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract
In next-generation wireless networks, low latency often necessitates short-length codewords that either do not use channel state information (CSI) or rely solely on CSI at the receiver (CSIR). Gaussian codes that achieve capacity for AWGN channels may be unsuitable for these no-CSI and CSIR-only cases. In this work, we design short-length codewords for these cases using an autoencoder architecture. From the designed codes, we observe the following: In the no-CSI case, the learned codes are mutually orthogonal when the distribution of the real and imaginary parts of the fading random variable has support over the entire real line. However, when the support is limited to the non-negative real line, the codes are not mutually orthogonal. For the CSIR-only case, deep learning-based codes designed for AWGN channels perform worse in fading channels with optimal coherent detection compared to codes specifically designed for fading channels with CSIR, where the autoencoder jointly learns encoding, coherent combining, and decoding. In both no-CSI and CSIR-only cases, the codes perform at least as well as or better than classical codes of the same block length.
Title:
ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning
Authors: Pei Deng, Wenqian Zhou, Hanlin Wu
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Remote sensing (RS) change analysis is vital for monitoring Earth's dynamic processes by detecting alterations in images over time. Traditional change detection excels at identifying pixel-level changes but lacks the ability to contextualize these alterations. While recent advancements in change captioning offer natural language descriptions of changes, they do not support interactive, user-specific queries. To address these limitations, we introduce ChangeChat, the first bitemporal vision-language model (VLM) designed specifically for RS change analysis. ChangeChat utilizes multimodal instruction tuning, allowing it to handle complex queries such as change captioning, category-specific quantification, and change localization. To enhance the model's performance, we developed the ChangeChat-87k dataset, which was generated using a combination of rule-based methods and GPT-assisted techniques. Experiments show that ChangeChat offers a comprehensive, interactive solution for RS change analysis, achieving performance comparable to or even better than state-of-the-art (SOTA) methods on specific tasks, and significantly surpassing the latest general-domain model, GPT-4. Code and pre-trained weights are available at this https URL.
Title:
TapToTab : Video-Based Guitar Tabs Generation using AI and Audio Analysis
Abstract
The automation of guitar tablature generation from video inputs holds significant promise for enhancing music education, transcription accuracy, and performance analysis. Existing methods face challenges with consistency and completeness, particularly in detecting fretboards and accurately identifying notes. To address these issues, this paper introduces an advanced approach leveraging deep learning, specifically YOLO models for real-time fretboard detection, and Fourier Transform-based audio analysis for precise note identification. Experimental results demonstrate substantial improvements in detection accuracy and robustness compared to traditional techniques. This paper outlines the development, implementation, and evaluation of these methodologies, aiming to revolutionize guitar instruction by automating the creation of guitar tabs from video recordings.
Title:
Sybil Detection using Graph Neural Networks
Authors: Stuart Heeb, Andreas Plesner, Roger Wattenhofer
Subjects: Subjects:
Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Abstract
This paper presents SYBILGAT, a novel approach to Sybil detection in social networks using Graph Attention Networks (GATs). Traditional methods for Sybil detection primarily leverage structural properties of networks; however, they tend to struggle with a large number of attack edges and are often unable to simultaneously utilize both known Sybil and honest nodes. Our proposed method addresses these limitations by dynamically assigning attention weights to different nodes during aggregations, enhancing detection performance. We conducted extensive experiments in various scenarios, including pretraining in sampled subgraphs, synthetic networks, and networks under targeted attacks. The results show that SYBILGAT significantly outperforms the state-of-the-art algorithms, particularly in scenarios with high attack complexity and when the number of attack edges increases. Our approach shows robust performance across different network models and sizes, even as the detection task becomes more challenging. We successfully applied the model to a real-world Twitter graph with more than 269k nodes and 6.8M edges. The flexibility and generalizability of SYBILGAT make it a promising tool to defend against Sybil attacks in online social networks with only structural information.
Title:
Training Gradient Boosted Decision Trees on Tabular Data Containing Label Noise for Classification Tasks
Authors: Anita Eisenbürger, Daniel Otten, Anselm Hudde, Frank Hopfgartner
Abstract
Label noise refers to the phenomenon where instances in a data set are assigned to the wrong label. Label noise is harmful to classifier performance, increases model complexity and impairs feature selection. Addressing label noise is crucial, yet current research primarily focuses on image and text data using deep neural networks. This leaves a gap in the study of tabular data and gradient-boosted decision trees (GBDTs), the leading algorithm for tabular data. Different methods have already been developed which either try to filter label noise, model label noise while simultaneously training a classifier or use learning algorithms which remain effective even if label noise is present. This study aims to further investigate the effects of label noise on gradient-boosted decision trees and methods to mitigate those effects. Through comprehensive experiments and analysis, the implemented methods demonstrate state-of-the-art noise detection performance on the Adult dataset and achieve the highest classification precision and recall on the Adult and Breast Cancer datasets, respectively. In summary, this paper enhances the understanding of the impact of label noise on GBDTs and lays the groundwork for future research in noise detection and correction methods.
Title:
Precision Aquaculture: An Integrated Computer Vision and IoT Approach for Optimized Tilapia Feeding
Authors: Rania Hossam, Ahmed Heakl, Walid Gomaa
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Abstract
Traditional fish farming practices often lead to inefficient feeding, resulting in environmental issues and reduced productivity. We developed an innovative system combining computer vision and IoT technologies for precise Tilapia feeding. Our solution uses real-time IoT sensors to monitor water quality parameters and computer vision algorithms to analyze fish size and count, determining optimal feed amounts. A mobile app enables remote monitoring and control. We utilized YOLOv8 for keypoint detection to measure Tilapia weight from length, achieving \textbf{94\%} precision on 3,500 annotated images. Pixel-based measurements were converted to centimeters using depth estimation for accurate feeding calculations. Our method, with data collection mirroring inference conditions, significantly improved results. Preliminary estimates suggest this approach could increase production up to 58 times compared to traditional farms. Our models, code, and dataset are open-source~\footnote{The code, dataset, and models are available upon reasonable request.
Title:
Personalized Weight Loss Management through Wearable Devices and Artificial Intelligence
Authors: Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Blanca Lacruz-Pleguezuelos, Sofia Bosch Pastor, Laura Judith Marcos-Zambrano, Guadalupe X. Bazán, Gala Freixer, Ruben Vera-Rodriguez, Julian Fierrez, Javier Ortega-Garcia, Isabel Espinosa-Salinas, Enrique Carrillo de Santa Pau
Abstract
Early detection of chronic and Non-Communicable Diseases (NCDs) is crucial for effective treatment during the initial stages. This study explores the application of wearable devices and Artificial Intelligence (AI) in order to predict weight loss changes in overweight and obese individuals. Using wearable data from a 1-month trial involving around 100 subjects from the AI4FoodDB database, including biomarkers, vital signs, and behavioral data, we identify key differences between those achieving weight loss (>= 2% of their initial weight) and those who do not. Feature selection techniques and classification algorithms reveal promising results, with the Gradient Boosting classifier achieving 84.44% Area Under the Curve (AUC). The integration of multiple data sources (e.g., vital signs, physical and sleep activity, etc.) enhances performance, suggesting the potential of wearable devices and AI in personalized healthcare.
Title:
Uncertainty Estimation by Density Aware Evidential Deep Learning
Abstract
Evidential deep learning (EDL) has shown remarkable success in uncertainty estimation. However, there is still room for improvement, particularly in out-of-distribution (OOD) detection and classification tasks. The limited OOD detection performance of EDL arises from its inability to reflect the distance between the testing example and training data when quantifying uncertainty, while its limited classification performance stems from its parameterization of the concentration parameters. To address these limitations, we propose a novel method called Density Aware Evidential Deep Learning (DAEDL). DAEDL integrates the feature space density of the testing example with the output of EDL during the prediction stage, while using a novel parameterization that resolves the issues in the conventional parameterization. We prove that DAEDL enjoys a number of favorable theoretical properties. DAEDL demonstrates state-of-the-art performance across diverse downstream tasks related to uncertainty estimation and classification
Title:
Energy Consumption Trends in Sound Event Detection Systems
Abstract
Deep learning systems have become increasingly energy- and computation-intensive, raising concerns about their environmental impact. As organizers of the Detection and Classification of Acoustic Scenes and Events (DCASE) challenge, we recognize the importance of addressing this issue. For the past three years, we have integrated energy consumption metrics into the evaluation of sound event detection (SED) systems. In this paper, we analyze the impact of this energy criterion on the challenge results and explore the evolution of system complexity and energy consumption over the years. We highlight a shift towards more energy-efficient approaches during training without compromising performance, while the number of operations and system complexity continue to grow. Through this analysis, we hope to promote more environmentally friendly practices within the SED community.
Title:
Reading ability detection using eye-tracking data with LSTM-based few-shot learning
Abstract
Reading ability detection is important in modern educational field. In this paper, a method of predicting scores of reading ability is proposed, using the eye-tracking data of a few subjects (e.g., 68 subjects). The proposed method built a regression model for the score prediction by combining Long Short Time Memory (LSTM) and light-weighted neural networks. Experiments show that with few-shot learning strategy, the proposed method achieved higher accuracy than previous methods of score prediction in reading ability detection. The code can later be downloaded at this https URL
Title:
Direct-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles via Proactive Attention
Abstract
Collaborative perception (CP) leverages visual data from connected and autonomous vehicles (CAV) to enhance an ego vehicle's field of view (FoV). Despite recent progress, current CP methods expand the ego vehicle's 360-degree perceptual range almost equally, which faces two key challenges. Firstly, in areas with uneven traffic distribution, focusing on directions with little traffic offers limited benefits. Secondly, under limited communication budgets, allocating excessive bandwidth to less critical directions lowers the perception accuracy in more vital areas. To address these issues, we propose Direct-CP, a proactive and direction-aware CP system aiming at improving CP in specific directions. Our key idea is to enable an ego vehicle to proactively signal its interested directions and readjust its attention to enhance local directional CP performance. To achieve this, we first propose an RSU-aided direction masking mechanism that assists an ego vehicle in identifying vital directions. Additionally, we design a direction-aware selective attention module to wisely aggregate pertinent features based on ego vehicle's directional priorities, communication budget, and the positional data of CAVs. Moreover, we introduce a direction-weighted detection loss (DWLoss) to capture the divergence between directional CP outcomes and the ground truth, facilitating effective model training. Extensive experiments on the V2X-Sim 2.0 dataset demonstrate that our approach achieves 19.8\% higher local perception accuracy in interested directions and 2.5\% higher overall perception accuracy than the state-of-the-art methods in collaborative 3D object detection tasks.
Title:
DeCLIP: Decoding CLIP representations for deepfake localization
Authors: Stefan Smeu, Elisabeta Oneata, Dan Oneata
Abstract
Generative models can create entirely new images, but they can also partially modify real images in ways that are undetectable to the human eye. In this paper, we address the challenge of automatically detecting such local manipulations. One of the most pressing problems in deepfake detection remains the ability of models to generalize to different classes of generators. In the case of fully manipulated images, representations extracted from large self-supervised models (such as CLIP) provide a promising direction towards more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage such large pretrained features for detecting local manipulations. We show that, when combined with a reasonably large convolutional decoder, pretrained self-supervised representations are able to perform localization and improve generalization capabilities over existing methods. Unlike previous work, our approach is able to perform localization on the challenging case of latent diffusion models, where the entire image is affected by the fingerprint of the generator. Moreover, we observe that this type of data, which combines local semantic information with a global fingerprint, provides more stable generalization than other categories of generative methods.
Title:
Detect Fake with Fake: Leveraging Synthetic Data-driven Representation for Synthetic Image Detection
Authors: Hina Otake, Yoshihiro Fukuhara, Yoshiki Kubotani, Shigeo Morishima
Abstract
Are general-purpose visual representations acquired solely from synthetic data useful for detecting fake images? In this work, we show the effectiveness of synthetic data-driven representations for synthetic image detection. Upon analysis, we find that vision transformers trained by the latest visual representation learners with synthetic data can effectively distinguish fake from real images without seeing any real images during pre-training. Notably, using SynCLR as the backbone in a state-of-the-art detection method demonstrates a performance improvement of +10.32 mAP and +4.73% accuracy over the widely used CLIP, when tested on previously unseen GAN models. Code is available at this https URL.
Title:
Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing
Abstract
Object detection in remote sensing imagery plays a vital role in various Earth observation applications. However, unlike object detection in natural scene images, this task is particularly challenging due to the abundance of small, often barely visible objects across diverse terrains. To address these challenges, multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy. Nonetheless, the performance of multimodal learning is often constrained by the limited size of labeled datasets. In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data to enhance detection performance. However, conventional MIM such as MAE which uses masked tokens without any contextual information, struggles to capture the fine-grained details due to a lack of interactions with other parts of image. To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing. The extensive ablation studies and evluation demonstrate the effectiveness of our approach.
Title:
E2MoCase: A Dataset for Emotional, Event and Moral Observations in News Articles on High-impact Legal Cases
Authors: Candida M. Greco, Lorenzo Zangari, Davide Picca, Andrea Tagarelli
Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Physics and Society (physics.soc-ph)
Abstract
The way media reports on legal cases can significantly shape public opinion, often embedding subtle biases that influence societal views on justice and morality. Analyzing these biases requires a holistic approach that captures the emotional tone, moral framing, and specific events within the narratives. In this work we introduce E2MoCase, a novel dataset designed to facilitate the integrated analysis of emotions, moral values, and events within legal narratives and media coverage. By leveraging advanced models for emotion detection, moral value identification, and event extraction, E2MoCase offers a multi-dimensional perspective on how legal cases are portrayed in news articles.
Title:
An Efficient and Streaming Audio Visual Active Speaker Detection System
Authors: Arnav Kundu, Yanzi Jin, Mohammad Sekhavat, Max Horton, Danny Tormoen, Devang Naik
Abstract
This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference. This tackles the persistent memory issues associated with running streaming ASD systems. Beyond these theoretical frameworks, we conduct extensive experiments to validate our approach. Our results demonstrate that constrained transformer models can achieve performance comparable to or even better than state-of-the-art recurrent models, such as uni-directional GRUs, with a significantly reduced number of context frames. Moreover, we shed light on the temporal memory requirements of ASD systems, revealing that larger past context has a more profound impact on accuracy than future context. When profiling on a CPU we find that our efficient architecture is memory bound by the amount of past context it can use and that the compute cost is negligible as compared to the memory cost.
Keyword: face recognition
Title:
SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition
Authors: Kassi Nzalasse, Rishav Raj, Eli Laird, Corey Clark
Abstract
As Artificial Intelligence applications expand, the evaluation of models faces heightened scrutiny. Ensuring public readiness requires evaluation datasets, which differ from training data by being disjoint and ethically sourced in compliance with privacy regulations. The performance and fairness of face recognition systems depend significantly on the quality and representativeness of these evaluation datasets. This data is sometimes scraped from the internet without user's consent, causing ethical concerns that can prohibit its use without proper releases. In rare cases, data is collected in a controlled environment with consent, however, this process is time-consuming, expensive, and logistically difficult to execute. This creates a barrier for those unable to conjure the immense resources required to gather ethically sourced evaluation datasets. To address these challenges, we introduce the Synthetic Identity Generation pipeline, or SIG, that allows for the targeted creation of ethical, balanced datasets for face recognition evaluation. Our proposed and demonstrated pipeline generates high-quality images of synthetic identities with controllable pose, facial features, and demographic attributes, such as race, gender, and age. We also release an open-source evaluation dataset named ControlFace10k, consisting of 10,008 face images of 3,336 unique synthetic identities balanced across race, gender, and age, generated using the proposed SIG pipeline. We analyze ControlFace10k along with a non-synthetic BUPT dataset using state-of-the-art face recognition algorithms to demonstrate its effectiveness as an evaluation tool. This analysis highlights the dataset's characteristics and its utility in assessing algorithmic bias across different demographic groups.
Title:
DiffFAS: Face Anti-Spoofing via Generative Diffusion Models
Authors: Xinxu Ge, Xin Liu, Zitong Yu, Jingang Shi, Chun Qi, Jie Li, Heikki Kälviäinen
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Face anti-spoofing (FAS) plays a vital role in preventing face recognition (FR) systems from presentation attacks. Nowadays, FAS systems face the challenge of domain shift, impacting the generalization performance of existing FAS methods. In this paper, we rethink about the inherence of domain shift and deconstruct it into two factors: image style and image quality. Quality influences the purity of the presentation of spoof information, while style affects the manner in which spoof information is presented. Based on our analysis, we propose DiffFAS framework, which quantifies quality as prior information input into the network to counter image quality shift, and performs diffusion-based high-fidelity cross-domain and cross-attack types generation to counter image style shift. DiffFAS transforms easily collectible live faces into high-fidelity attack faces with precise labels while maintaining consistency between live and spoof face identities, which can also alleviate the scarcity of labeled data with novel type attacks faced by nowadays FAS system. We demonstrate the effectiveness of our framework on challenging cross-domain and cross-attack FAS datasets, achieving the state-of-the-art performance. Available at this https URL.
Abstract
Meta-learning has emerged as a powerful approach for leveraging knowledge from previous tasks to solve new tasks. The mainstream methods focus on training a well-generalized model initialization, which is then adapted to different tasks with limited data and updates. However, it pushes the model overfitting on the training tasks. Previous methods mainly attributed this to the lack of data and used augmentations to address this issue, but they were limited by sufficient training and effective augmentation strategies. In this work, we focus on the more fundamental learning to learn'' strategy of meta-learning to explore what causes errors and how to eliminate these errors without changing the environment. Specifically, we first rethink the algorithmic procedure of meta-learning from alearning'' lens. Through theoretical and empirical analyses, we find that (i) this paradigm faces the risk of both overfitting and underfitting and (ii) the model adapted to different tasks promote each other where the effect is stronger if the tasks are more similar. Based on this insight, we propose using task relations to calibrate the optimization process of meta-learning and propose a plug-and-play method called Task Relation Learner (TRLearner) to achieve this goal. Specifically, it first obtains task relation matrices from the extracted task-specific meta-data. Then, it uses the obtained matrices with relation-aware consistency regularization to guide optimization. Extensive theoretical and empirical analyses demonstrate the effectiveness of TRLearner.
Title:
Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense
Authors: Jeremy Styborski, Mingzhi Lyu, Yi Huang, Adams Kong
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Availability poisons exploit supervised learning (SL) algorithms by introducing class-related shortcut features in images such that models trained on poisoned data are useless for real-world datasets. Self-supervised learning (SSL), which utilizes augmentations to learn instance discrimination, is regarded as a strong defense against poisoned data. However, by extending the study of SSL across multiple poisons on the CIFAR-10 and ImageNet-100 datasets, we demonstrate that it often performs poorly, far below that of training on clean data. Leveraging the vulnerability of SL to poison attacks, we introduce adversarial training (AT) on SL to obfuscate poison features and guide robust feature learning for SSL. Our proposed defense, designated VESPR (Vulnerability Exploitation of Supervised Poisoning for Robust SSL), surpasses the performance of six previous defenses across seven popular availability poisons. VESPR displays superior performance over all previous defenses, boosting the minimum and average ImageNet-100 test accuracies of poisoned models by 16% and 9%, respectively. Through analysis and ablation studies, we elucidate the mechanisms by which VESPR learns robust class features.
Title:
Test-time Training for Hyperspectral Image Super-resolution
Authors: Ke Li, Luc Van Gool, Dengxin Dai
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
The progress on Hyperspectral image (HSI) super-resolution (SR) is still lagging behind the research of RGB image SR. HSIs usually have a high number of spectral bands, so accurately modeling spectral band interaction for HSI SR is hard. Also, training data for HSI SR is hard to obtain so the dataset is usually rather small. In this work, we propose a new test-time training method to tackle this problem. Specifically, a novel self-training framework is developed, where more accurate pseudo-labels and more accurate LR-HR relationships are generated so that the model can be further trained with them to improve performance. In order to better support our test-time training method, we also propose a new network architecture to learn HSI SR without modeling spectral band interaction and propose a new data augmentation method Spectral Mixup to increase the diversity of the training data at test time. We also collect a new HSI dataset with a diverse set of images of interesting objects ranging from food to vegetation, to materials, and to general scenes. Extensive experiments on multiple datasets show that our method can improve the performance of pre-trained models significantly after test-time training and outperform competing methods significantly for HSI SR.
Title:
GenMapping: Unleashing the Potential of Inverse Perspective Mapping for Robust Online HD Map Construction
Authors: Siyu Li, Kailun Yang, Hao Shi, Song Wang, You Yao, Zhiyong Li
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
Abstract
Online High-Definition (HD) maps have emerged as the preferred option for autonomous driving, overshadowing the counterpart offline HD maps due to flexible update capability and lower maintenance costs. However, contemporary online HD map models embed parameters of visual sensors into training, resulting in a significant decrease in generalization performance when applied to visual sensors with different parameters. Inspired by the inherent potential of Inverse Perspective Mapping (IPM), where camera parameters are decoupled from the training process, we have designed a universal map generation framework, GenMapping. The framework is established with a triadic synergy architecture, including principal and dual auxiliary branches. When faced with a coarse road image with local distortion translated via IPM, the principal branch learns robust global features under the state space models. The two auxiliary branches are a dense perspective branch and a sparse prior branch. The former exploits the correlation information between static and moving objects, whereas the latter introduces the prior knowledge of OpenStreetMap (OSM). The triple-enhanced merging module is crafted to synergistically integrate the unique spatial features from all three branches. To further improve generalization capabilities, a Cross-View Map Learning (CVML) scheme is leveraged to realize joint learning within the common space. Additionally, a Bidirectional Data Augmentation (BiDA) module is introduced to mitigate reliance on datasets concurrently. A thorough array of experimental results shows that the proposed model surpasses current state-of-the-art methods in both semantic mapping and vectorized mapping, while also maintaining a rapid inference speed. The source code will be publicly available at this https URL.
Title:
Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages
Authors: Yao-Fei Cheng, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang
Subjects: Subjects:
Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.
Keyword: detection
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Keyword: face recognition
Title:
Title:
Keyword: augmentation
Title:
learning to learn'' strategy of meta-learning to explore what causes errors and how to eliminate these errors without changing the environment. Specifically, we first rethink the algorithmic procedure of meta-learning from a
learning'' lens. Through theoretical and empirical analyses, we find that (i) this paradigm faces the risk of both overfitting and underfitting and (ii) the model adapted to different tasks promote each other where the effect is stronger if the tasks are more similar. Based on this insight, we propose using task relations to calibrate the optimization process of meta-learning and propose a plug-and-play method called Task Relation Learner (TRLearner) to achieve this goal. Specifically, it first obtains task relation matrices from the extracted task-specific meta-data. Then, it uses the obtained matrices with relation-aware consistency regularization to guide optimization. Extensive theoretical and empirical analyses demonstrate the effectiveness of TRLearner.Title:
Title:
Title:
Title: