Abstract
The pattern of state changes in a biomedical time series can be related to health or disease. This work presents a principled approach for selecting a changepoint detection algorithm for a specific task, such as disease classification. Eight key algorithms were compared, and the performance of each algorithm was evaluated as a function of temporal tolerance, noise, and abnormal conduction (ectopy) on realistic artificial cardiovascular time series data. All algorithms were applied to real data (cardiac time series of 22 patients with REM-behavior disorder (RBD) and 15 healthy controls) using the parameters selected on artificial data. Finally, features were derived from the detected changepoints to classify RBD patients from healthy controls using a K-Nearest Neighbors approach. On artificial data, Modified Bayesian Changepoint Detection algorithm provided superior positive predictive value for state change identification while Recursive Mean Difference Maximization (RMDM) achieved the highest true positive rate. For the classification task, features derived from the RMDM algorithm provided the highest leave one out cross validated accuracy of 0.89 and true positive rate of 0.87. Automatically detected changepoints provide useful information about subject's physiological state which cannot be directly observed. However, the choice of change point detection algorithm depends on the nature of the underlying data and the downstream application, such as a classification task. This work represents the first time change point detection algorithms have been compared in a meaningful way and utilized in a classification task, which demonstrates the effect of changepoint algorithm choice on application performance.
Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds
Authors: Authors: Oliver Lemke, Zuria Bauer, René Zurbrügg, Marc Pollefeys, Francis Engelmann, Hermann Blum
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: https://spot-compose.github.io/.
A visualization method for data domain changes in CNN networks and the optimization method for selecting thresholds in classification tasks
Authors: Authors: Minzhe Huang, Changwei Nie, Weihong Zhong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
In recent years, Face Anti-Spoofing (FAS) has played a crucial role in preserving the security of face recognition technology. With the rise of counterfeit face generation techniques, the challenge posed by digitally edited faces to face anti-spoofing is escalating. Existing FAS technologies primarily focus on intercepting physically forged faces and lack a robust solution for cross-domain FAS challenges. Moreover, determining an appropriate threshold to achieve optimal deployment results remains an issue for intra-domain FAS. To address these issues, we propose a visualization method that intuitively reflects the training outcomes of models by visualizing the prediction results on datasets. Additionally, we demonstrate that employing data augmentation techniques, such as downsampling and Gaussian blur, can effectively enhance performance on cross-domain tasks. Building upon our data visualization approach, we also introduce a methodology for setting threshold values based on the distribution of the training dataset. Ultimately, our methods secured us second place in both the Unified Physical-Digital Face Attack Detection competition and the Snapshot Spectral Imaging Face Anti-spoofing contest. The training code is available at https://github.com/SeaRecluse/CVPRW2024.
ELEV-VISION-SAM: Integrated Vision Language and Foundation Model for Automated Estimation of Building Lowest Floor Elevation
Authors: Authors: Yu-Hsuan Ho, Longxiang Li, Ali Mostafavi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Street view imagery, aided by advancements in image quality and accessibility, has emerged as a valuable resource for urban analytics research. Recent studies have explored its potential for estimating lowest floor elevation (LFE), offering a scalable alternative to traditional on-site measurements, crucial for assessing properties' flood risk and damage extent. While existing methods rely on object detection, the introduction of image segmentation has broadened street view images' utility for LFE estimation, although challenges still remain in segmentation quality and capability to distinguish front doors from other doors. To address these challenges in LFE estimation, this study integrates the Segment Anything model, a segmentation foundation model, with vision language models to conduct text-prompt image segmentation on street view images for LFE estimation. By evaluating various vision language models, integration methods, and text prompts, we identify the most suitable model for street view image analytics and LFE estimation tasks, thereby improving the availability of the current LFE estimation model based on image segmentation from 33% to 56% of properties. Remarkably, our proposed method significantly enhances the availability of LFE estimation to almost all properties in which the front door is visible in the street view image. Also the findings present the first baseline and comparison of various vision models of street view image-based LFE estimation. The model and findings not only contribute to advancing street view image segmentation for urban analytics but also provide a novel approach for image segmentation tasks for other civil engineering and infrastructure analytics tasks.
Greedy Detection and Exclusion of Multiple Faults using Euclidean Distance Matrices
Abstract
Numerous methods have been proposed for global navigation satellite system (GNSS) receivers to detect faulty GNSS signals. One such fault detection and exclusion (FDE) method is based on the mathematical concept of Euclidean distance matrices (EDMs). This paper outlines a greedy approach that uses an improved Euclidean distance matrix-based fault detection and exclusion algorithm. The novel greedy EDM FDE method implements a new fault detection test statistic and fault exclusion strategy that drastically simplifies the complexity of the algorithm over previous work. To validate the novel greedy EDM FDE algorithm, we created a simulated dataset using receiver locations from around the globe. The simulated dataset allows us to verify our results on 2,601 different satellite geometries. Additionally, we tested the greedy EDM FDE algorithm using a real-world dataset from seven different android phones. Across both the simulated and real-world datasets, the Python implementation of the greedy EDM FDE algorithm is shown to be computed an order of magnitude more rapidly than a comparable greedy residual FDE method while obtaining similar fault exclusion accuracy. We provide discussion on the comparative time complexities of greedy EDM FDE, greedy residual FDE, and solution separation. We also explain potential modifications to greedy residual FDE that can be added to alter performance characteristics.
SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers
Authors: Authors: Vandad Davoodnia, Saeed Ghorbani, Alexandre Messier, Ali Etemad
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations. This module integrates prior knowledge about pose space and infers the full pose state at runtime. Separating the 3D keypoint detection and inverse-kinematic problems, along with the expressive representations learned by our skeletal transformer, enhance the generalization of our method to unseen noisy data. We evaluate our method on three public datasets in both in-distribution and out-of-distribution settings using three datasets, and observe strong performance with respect to prior works. Moreover, ablation experiments demonstrate the impact of each of the modules of our architecture. Finally, we study the performance of our method in dealing with noise and heavy occlusions and find considerable robustness with respect to other solutions.
AED-PADA:Improving Generalizability of Adversarial Example Detection via Principal Adversarial Domain Adaptation
Abstract
Adversarial example detection, which can be conveniently applied in many scenarios, is important in the area of adversarial defense. Unfortunately, existing detection methods suffer from poor generalization performance, because their training process usually relies on the examples generated from a single known adversarial attack and there exists a large discrepancy between the training and unseen testing adversarial examples. To address this issue, we propose a novel method, named Adversarial Example Detection via Principal Adversarial Domain Adaptation (AED-PADA). Specifically, our approach identifies the Principal Adversarial Domains (PADs), i.e., a combination of features of the adversarial examples from different attacks, which possesses large coverage of the entire adversarial feature space. Then, we pioneer to exploit multi-source domain adaptation in adversarial example detection with PADs as source domains. Experiments demonstrate the superior generalization ability of our proposed AED-PADA. Note that this superiority is particularly achieved in challenging scenarios characterized by employing the minimal magnitude constraint for the perturbations.
Emerging NGSO Constellations: Spectral Coexistence with GSO Satellite Communication Systems
Authors: Authors: Flor Ortiz, Eva Lagunas, Almoatssimbillah Saifaldawla, Mahdis Jalali, Luis Emiliani, Symeon Chatzinotas
Abstract
Global communications have undergone a paradigm shift with the rapid expansion of low-earth orbit (LEO) satellite constellations, offering a new space era of reduced latency and ubiquitous, high-speed broadband internet access. However, the fast developments in LEO orbits pose significant challenges, particularly the coexistence with geostationary earth orbit (GEO) satellite systems. This article presents an overview of the regulatory aspects that cover the spectrum sharing in the bands allocated to the Fixed Satellite Service between geostationary networks (GSO) and non-geostationary systems (NGSO), as well as the main interference mitigation techniques for their coexistence. Our work highlights the increased potential for inter-system interference. It explores the regulatory landscape following the World Radio Conference (WRC-23). We discuss the different interference management strategies proposed for the GSO-NGSO spectral coexistence, including on-board and ground-based approaches and more advanced mitigation techniques based on beamforming. Moving onto operational aspects related to the sharing of spectrum, we introduce recent work on interference detection, identification, and mitigation and provide our vision of the emerging role of artificial intelligence (AI) in the aforementioned tasks.
SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis
Abstract
In the social media, users frequently express personal emotions, a subset of which may indicate potential suicidal tendencies. The implicit and varied forms of expression in internet language complicate accurate and rapid identification of suicidal intent on social media, thus creating challenges for timely intervention efforts. The development of deep learning models for suicide risk detection is a promising solution, but there is a notable lack of relevant datasets, especially in the Chinese context. To address this gap, this study presents a Chinese social media dataset designed for fine-grained suicide risk classification, focusing on indicators such as expressions of suicide intent, methods of suicide, and urgency of timing. Seven pre-trained models were evaluated in two tasks: high and low suicide risk, and fine-grained suicide risk classification on a level of 0 to 10. In our experiments, deep learning models show good performance in distinguishing between high and low suicide risk, with the best model achieving an F1 score of 88.39%. However, the results for fine-grained suicide risk classification were still unsatisfactory, with an weighted F1 score of 50.89%. To address the issues of data imbalance and limited dataset size, we investigated both traditional and advanced, large language model based data augmentation techniques, demonstrating that data augmentation can enhance model performance by up to 4.65% points in F1-score. Notably, the Chinese MentalBERT model, which was pre-trained on psychological domain data, shows superior performance in both tasks. This study provides valuable insights for automatic identification of suicidal individuals, facilitating timely psychological intervention on social media platforms. The source code and data are publicly available.
Detecting Out-Of-Distribution Earth Observation Images with Diffusion Models
Authors: Authors: Georges Le Bellier (CEDRIC - VERTIGO, CNAM), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, IGN)
Abstract
Earth Observation imagery can capture rare and unusual events, such as disasters and major landscape changes, whose visual appearance contrasts with the usual observations. Deep models trained on common remote sensing data will output drastically different features for these out-of-distribution samples, compared to those closer to their training dataset. Detecting them could therefore help anticipate changes in the observations, either geographical or environmental. In this work, we show that the reconstruction error of diffusion models can effectively serve as unsupervised out-of-distribution detectors for remote sensing images, using them as a plausibility score. Moreover, we introduce ODEED, a novel reconstruction-based scorer using the probability-flow ODE of diffusion models. We validate it experimentally on SpaceNet 8 with various scenarios, such as classical OOD detection with geographical shift and near-OOD setups: pre/post-flood and non-flooded/flooded image recognition. We show that our ODEED scorer significantly outperforms other diffusion-based and discriminative baselines on the more challenging near-OOD scenarios of flood image detection, where OOD images are close to the distribution tail. We aim to pave the way towards better use of generative models for anomaly detection in remote sensing.
Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model
Authors: Authors: Jihao Dong, Renjie Pan, Hua Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.
VoxAtnNet: A 3D Point Clouds Convolutional Neural Network for Generalizable Face Presentation Attack Detection
Authors: Authors: Raghavendra Ramachandra, Narayan Vetrekar, Sushma Venkatesh, Savita Nageshker, Jag Mohan Singh, R. S. Gad
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Abstract
Facial biometrics are an essential components of smartphones to ensure reliable and trustworthy authentication. However, face biometric systems are vulnerable to Presentation Attacks (PAs), and the availability of more sophisticated presentation attack instruments such as 3D silicone face masks will allow attackers to deceive face recognition systems easily. In this work, we propose a novel Presentation Attack Detection (PAD) algorithm based on 3D point clouds captured using the frontal camera of a smartphone to detect presentation attacks. The proposed PAD algorithm, VoxAtnNet, processes 3D point clouds to obtain voxelization to preserve the spatial structure. Then, the voxelized 3D samples were trained using the novel convolutional attention network to detect PAs on the smartphone. Extensive experiments were carried out on the newly constructed 3D face point cloud dataset comprising bona fide and two different 3D PAIs (3D silicone face mask and wrap photo mask), resulting in 3480 samples. The performance of the proposed method was compared with existing methods to benchmark the detection performance using three different evaluation protocols. The experimental results demonstrate the improved performance of the proposed method in detecting both known and unknown face presentation attacks.
Modeling Multi-Granularity Context Information Flow for Pavement Crack Detection
Abstract
Crack detection has become an indispensable, interesting yet challenging task in the computer vision community. Specially, pavement cracks have a highly complex spatial structure, a low contrasting background and a weak spatial continuity, posing a significant challenge to an effective crack detection method. In this paper, we address these problems from a view that utilizes contexts of the cracks and propose an end-to-end deep learning method to model the context information flow. To precisely localize crack from an image, it is critical to effectively extract and aggregate multi-granularity context, including the fine-grained local context around the cracks (in spatial-level) and the coarse-grained semantics (in segment-level). Concretely, in Convolutional Neural Network (CNN), low-level features extracted by the shallow layers represent the local information, while the deep layers extract the semantic features. Additionally, a second main insight in this work is that the semantic context should be an guidance to local context feature. By the above insights, the proposed method we first apply the dilated convolution as the backbone feature extractor to model local context, then we build a context guidance module to leverage semantic context to guide local feature extraction at multiple stages. To handle label alignment between stages, we apply the Multiple Instance Learning (MIL) strategy to align the high-level feature to the low-level ones in the stage-wise context flow. In addition, compared with these public crack datasets, to our best knowledge, we release the largest, most complex and most challenging Bitumen Pavement Crack (BPC) dataset. The experimental results on the three crack datasets demonstrate that the proposed method performs well and outperforms the current state-of-the-art methods.
uTRAND: Unsupervised Anomaly Detection in Traffic Trajectories
Authors: Authors: Giacomo D'Amicantonio, Egor Bondarau, Peter H.N. de With
Abstract
Deep learning-based approaches have achieved significant improvements on public video anomaly datasets, but often do not perform well in real-world applications. This paper addresses two issues: the lack of labeled data and the difficulty of explaining the predictions of a neural network. To this end, we present a framework called uTRAND, that shifts the problem of anomalous trajectory prediction from the pixel space to a semantic-topological domain. The framework detects and tracks all types of traffic agents in bird's-eye-view videos of traffic cameras mounted at an intersection. By conceptualizing the intersection as a patch-based graph, it is shown that the framework learns and models the normal behaviour of traffic agents without costly manual labeling. Furthermore, uTRAND allows to formulate simple rules to classify anomalous trajectories in a way suited for human interpretation. We show that uTRAND outperforms other state-of-the-art approaches on a dataset of anomalous trajectories collected in a real-world setting, while producing explainable detection results.
Energy Conserved Failure Detection for NS-IoT Systems
Authors: Authors: Guojin Liu, Jianhong Zhou, Hang Su, Biaohong Xiong, Xianhua Niu
Subjects: Networking and Internet Architecture (cs.NI)
Abstract
Nowadays, network slicing (NS) technology has gained widespread adoption within Internet of Things (IoT) systems to meet diverse customized requirements. In the NS based IoT systems, the detection of equipment failures necessitates comprehensive equipment monitoring, which leads to significant resource utilization, particularly within large-scale IoT ecosystems. Thus, the imperative task of reducing failure rates while optimizing monitoring costs has emerged. In this paper, we propose a monitor application function (MAF) based dynamic dormancy monitoring mechanism for the novel NS-IoT system, which is based on a network data analysis function (NWDAF) framework defined in Rel-17. Within the NS-IoT system, all nodes are organized into groups, and multiple MAFs are deployed to monitor each group of nodes. We also propose a dormancy monitor mechanism to mitigate the monitoring energy consumption by placing the MAFs, which is monitoring non-failure devices, in a dormant state. We propose a reinforcement learning based PPO algorithm to guide the dynamic dormancy of MAFs. Simulation results demonstrate that our dynamic dormancy strategy maximizes energy conservation, while proposed algorithm outperforms alternatives in terms of efficiency and stability.
REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking
Authors: Authors: Nacime Bouziani, Shubhi Tyagi, Joseph Fisher, Jens Lehmann, Andrea Pierleoni
Abstract
Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual subtasks and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which is available at https://github.com/amazon-science/e2e-docie.
A Point-Based Approach to Efficient LiDAR Multi-Task Perception
Authors: Authors: Christopher Lang, Alexander Braun, Lars Schillingmann, Abhinav Valada
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Multi-task networks can potentially improve performance and computational efficiency compared to single-task networks, facilitating online deployment. However, current multi-task architectures in point cloud perception combine multiple task-specific point cloud representations, each requiring a separate feature encoder and making the network structures bulky and slow. We propose PAttFormer, an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds that only relies on a point-based representation. The network builds on transformer-based feature encoders using neighborhood attention and grid-pooling and a query-based detection decoder using a novel 3D deformable-attention detection head design. Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for multiple task-specific point cloud representations, resulting in a network that is 3x smaller and 1.4x faster while achieving competitive performance on the nuScenes and KITTI benchmarks for autonomous driving perception. Our extensive evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP on the nuScenes benchmark compared to the single-task models.
360° phase detector cell for measurement systems based on switched dual multipliers
Authors: Authors: Baltasar Pérez, Víctor A. Araña, Javier Perez-Mato, Francisco Cabrera
Abstract
This letter presents a 360{\deg} phase detector cell for performing phase-shift measurements on multiple output systems. An analog phase detector, capable of detecting a maximum range of {\pm}90{\deg}, has been used to perform a double multiplication of two signals, both in-phase and phase-shifted. The proposed solution broadens the frequency range beyond other solutions that require to fulfill the quadrature condition. Subsequently, the possibility of reaching the theoretical limit of phase shift within a hybrid coupler ({\Phi} < 90{\deg} {\pm} 90{\deg}) is discussed by using four straight-line equations to characterize the phase detector response. The proposed solution allows to extend up to 360{\deg} the phase detection range and provide an increased immunity with respect to both impedance mismatching and phase deviations within the hybrid coupler. To demonstrate the feasibility of the proposed design, a phase detector cell prototype has been implemented using a commercial hybrid coupler with a phase shift of 92.5{\deg} {\pm} 0.5{\deg} at 3.1-5.9 GHz, an external switch and a microcontroller with 2 kB of memory. Measurements show a range of detection of 360{\deg} ({\pm}180{\deg}) across the tested frequency band of 2.7-6 GHz.
ECOR: Explainable CLIP for Object Recognition
Authors: Authors: Ali Rasekh, Sepehr Kazemi Ranjbar, Milad Heidari, Wolfgang Nejdl
Abstract
Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. Their open vocabulary feature enhances their value. However, their black-box nature and lack of explainability in predictions make them less trustworthy in critical domains. Recently, some work has been done to force VLMs to provide reasonable rationales for object recognition, but this often comes at the expense of classification accuracy. In this paper, we first propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluations of different datasets, our method demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. The code will be made available online upon publication.
Ransomware Detection and Classification Using Random Forest: A Case Study with the UGRansome2024 Dataset
Authors: Authors: Peace Azugo, Hein Venter, Mike Wa Nkongolo
Abstract
Cybersecurity faces challenges in identifying and mitigating ransomware, which is important for protecting critical infrastructures. The absence of datasets for distinguishing normal versus abnormal network behaviour hinders the development of proactive detection strategies against ransomware. An obstacle in proactive prevention methods is the absence of comprehensive datasets for contrasting normal versus abnormal network behaviours. The dataset enabling such contrasts would significantly expedite threat anomaly mitigation. In this study, we introduce UGRansome2024, an optimised dataset for ransomware detection in network traffic. This dataset is derived from the UGRansome data using an intuitionistic feature engineering approach that considers only relevant patterns in network behaviour analysis. The study presents an analysis of ransomware detection using the UGRansome2024 dataset and the Random Forest algorithm. Through encoding and feature relevance determination, the Random Forest achieved a classification accuracy of 96% and effectively identified unusual ransomware transactions. Findings indicate that certain ransomware variants, such as those utilising Encrypt Decrypt Algorithms (EDA) and Globe ransomware, have the highest financial impact. These insights have significant implications for real-world cybersecurity practices, highlighting the importance of machine learning in ransomware detection and mitigation. Further research is recommended to expand datasets, explore alternative detection methods, and address limitations in current approaches.
Language-Driven Active Learning for Diverse Open-Set 3D Object Detection
Authors: Authors: Ross Greer, Bjørk Antoniussen, Andreas Møgelmose, Mohan Trivedi
Abstract
Object detection is crucial for ensuring safe autonomous driving. However, data-driven approaches face challenges when encountering minority or novel objects in the 3D driving scene. In this paper, we propose VisLED, a language-driven active learning framework for diverse open-set 3D Object Detection. Our method leverages active learning techniques to query diverse and informative data samples from an unlabeled pool, enhancing the model's ability to detect underrepresented or novel objects. Specifically, we introduce the Vision-Language Embedding Diversity Querying (VisLED-Querying) algorithm, which operates in both open-world exploring and closed-world mining settings. In open-world exploring, VisLED-Querying selects data points most novel relative to existing data, while in closed-world mining, it mines new instances of known classes. We evaluate our approach on the nuScenes dataset and demonstrate its effectiveness compared to random sampling and entropy-querying methods. Our results show that VisLED-Querying consistently outperforms random sampling and offers competitive performance compared to entropy-querying despite the latter's model-optimality, highlighting the potential of VisLED for improving object detection in autonomous driving scenarios.
Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images
Authors: Authors: Santosh, Li Lin, Irene Amerini, Xin Wang, Shu Hu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Abstract
Diffusion models (DMs) have revolutionized image generation, producing high-quality images with applications spanning various fields. However, their ability to create hyper-realistic images poses significant challenges in distinguishing between real and synthetic content, raising concerns about digital authenticity and potential misuse in creating deepfakes. This work introduces a robust detection framework that integrates image and text features extracted by CLIP model with a Multilayer Perceptron (MLP) classifier. We propose a novel loss that can improve the detector's robustness and handle imbalanced datasets. Additionally, we flatten the loss landscape during the model training to improve the detector's generalization capabilities. The effectiveness of our method, which outperforms traditional detection techniques, is demonstrated through extensive experiments, underscoring its potential to set a new state-of-the-art approach in DM-generated image detection. The code is available at https://github.com/Purdue-M2/Robust_DM_Generated_Image_Detection.
Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media Data
Authors: Authors: Oana Ignat, Gayathri Ganesh Lakshmy, Rada Mihalcea
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Inspiration is linked to various positive outcomes, such as increased creativity, productivity, and happiness. Although inspiration has great potential, there has been limited effort toward identifying content that is inspiring, as opposed to just engaging or positive. Additionally, most research has concentrated on Western data, with little attention paid to other cultures. This work is the first to study cross-cultural inspiration through machine learning methods. We aim to identify and analyze real and AI-generated cross-cultural inspiring posts. To this end, we compile and make publicly available the InspAIred dataset, which consists of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 generated inspiring posts evenly distributed across India and the UK. The real posts are sourced from Reddit, while the generated posts are created using the GPT-4 model. Using this dataset, we conduct extensive computational linguistic analyses to (1) compare inspiring content across cultures, (2) compare AI-generated inspiring posts to real inspiring posts, and (3) determine if detection models can accurately distinguish between inspiring content across cultures and data sources.
MAiDE-up: Multilingual Deception Detection of GPT-generated Hotel Reviews
Authors: Authors: Oana Ignat, Xiaomeng Xu, Rada Mihalcea
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Deceptive reviews are becoming increasingly common, especially given the increase in performance and the prevalence of LLMs. While work to date has addressed the development of models to differentiate between truthful and deceptive human reviews, much less is known about the distinction between real reviews and AI-authored fake reviews. Moreover, most of the research so far has focused primarily on English, with very little work dedicated to other languages. In this paper, we compile and make publicly available the MAiDE-up dataset, consisting of 10,000 real and 10,000 AI-generated fake hotel reviews, balanced across ten languages. Using this dataset, we conduct extensive linguistic analyses to (1) compare the AI fake hotel reviews to real hotel reviews, and (2) identify the factors that influence the deception detection model performance. We explore the effectiveness of several models for deception detection in hotel reviews across three main dimensions: sentiment, location, and language. We find that these dimensions influence how well we can detect AI-generated fake reviews.
Visualizing Intelligent Tutor Interactions for Responsive Pedagogy
Authors: Authors: Grace Guo, Aishwarya Mudgal Sunil Kumar, Adit Gupta, Adam Coscia, Chris MacLellan, Alex Endert
Abstract
Intelligent tutoring systems leverage AI models of expert learning and student knowledge to deliver personalized tutoring to students. While these intelligent tutors have demonstrated improved student learning outcomes, it is still unclear how teachers might integrate them into curriculum and course planning to support responsive pedagogy. In this paper, we conducted a design study with five teachers who have deployed Apprentice Tutors, an intelligent tutoring platform, in their classes. We characterized their challenges around analyzing student interaction data from intelligent tutoring systems and built VisTA (Visualizations for Tutor Analytics), a visual analytics system that shows detailed provenance data across multiple coordinated views. We evaluated VisTA with the same five teachers, and found that the visualizations helped them better interpret intelligent tutor data, gain insights into student problem-solving provenance, and decide on necessary follow-up actions - such as providing students with further support or reviewing skills in the classroom. Finally, we discuss potential extensions of VisTA into sequence query and detection, as well as the potential for the visualizations to be useful for encouraging self-directed learning in students.
Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach
Authors: Authors: Mohammed Yousif, Jonat John Mathew, Huzaifa Pallan, Agamjeet Singh Padda, Syed Daniyal Shah, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Abstract
Generalization in audio deepfake detection presents a significant challenge, with models trained on specific datasets often struggling to detect deepfakes generated under varying conditions and unknown algorithms. While collectively training a model using diverse datasets can enhance its generalization ability, it comes with high computational costs. To address this, we propose a neural collapse-based sampling approach applied to pre-trained models trained on distinct datasets to create a new training database. Using ASVspoof 2019 dataset as a proof-of-concept, we implement pre-trained models with Resnet and ConvNext architectures. Our approach demonstrates comparable generalization on unseen data while being computationally efficient, requiring less training data. Evaluation is conducted using the In-the-wild dataset.
Keyword: face recognition
A visualization method for data domain changes in CNN networks and the optimization method for selecting thresholds in classification tasks
Authors: Authors: Minzhe Huang, Changwei Nie, Weihong Zhong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
In recent years, Face Anti-Spoofing (FAS) has played a crucial role in preserving the security of face recognition technology. With the rise of counterfeit face generation techniques, the challenge posed by digitally edited faces to face anti-spoofing is escalating. Existing FAS technologies primarily focus on intercepting physically forged faces and lack a robust solution for cross-domain FAS challenges. Moreover, determining an appropriate threshold to achieve optimal deployment results remains an issue for intra-domain FAS. To address these issues, we propose a visualization method that intuitively reflects the training outcomes of models by visualizing the prediction results on datasets. Additionally, we demonstrate that employing data augmentation techniques, such as downsampling and Gaussian blur, can effectively enhance performance on cross-domain tasks. Building upon our data visualization approach, we also introduce a methodology for setting threshold values based on the distribution of the training dataset. Ultimately, our methods secured us second place in both the Unified Physical-Digital Face Attack Detection competition and the Snapshot Spectral Imaging Face Anti-spoofing contest. The training code is available at https://github.com/SeaRecluse/CVPRW2024.
Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations
Authors: Authors: Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, Surajit Chaudhuri
Subjects: Databases (cs.DB); Computation and Language (cs.CL); Programming Languages (cs.PL)
Abstract
Spreadsheets are widely recognized as the most popular end-user programming tools, which blend the power of formula-based computation, with an intuitive table-based interface. Today, spreadsheets are used by billions of users to manipulate tables, most of whom are neither database experts nor professional programmers. Despite the success of spreadsheets, authoring complex formulas remains challenging, as non-technical users need to look up and understand non-trivial formula syntax. To address this pain point, we leverage the observation that there is often an abundance of similar-looking spreadsheets in the same organization, which not only have similar data, but also share similar computation logic encoded as formulas. We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell, by learning and adapting formulas that already exist in similar spreadsheets, using contrastive-learning techniques inspired by "similar-face recognition" from compute vision. Extensive evaluations on over 2K test formulas extracted from real enterprise spreadsheets show the effectiveness of Auto-Formula over alternatives. Our benchmark data is available at https://github.com/microsoft/Auto-Formula to facilitate future research.
MLSD-GAN -- Generating Strong High Quality Face Morphing Attacks using Latent Semantic Disentanglement
Abstract
Face-morphing attacks are a growing concern for biometric researchers, as they can be used to fool face recognition systems (FRS). These attacks can be generated at the image level (supervised) or representation level (unsupervised). Previous unsupervised morphing attacks have relied on generative adversarial networks (GANs). More recently, researchers have used linear interpolation of StyleGAN-encoded images to generate morphing attacks. In this paper, we propose a new method for generating high-quality morphing attacks using StyleGAN disentanglement. Our approach, called MLSD-GAN, spherically interpolates the disentangled latents to produce realistic and diverse morphing attacks. We evaluate the vulnerability of MLSD-GAN on two deep-learning-based FRS techniques. The results show that MLSD-GAN poses a significant threat to FRS, as it can generate morphing attacks that are highly effective at fooling these systems.
VoxAtnNet: A 3D Point Clouds Convolutional Neural Network for Generalizable Face Presentation Attack Detection
Authors: Authors: Raghavendra Ramachandra, Narayan Vetrekar, Sushma Venkatesh, Savita Nageshker, Jag Mohan Singh, R. S. Gad
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Abstract
Facial biometrics are an essential components of smartphones to ensure reliable and trustworthy authentication. However, face biometric systems are vulnerable to Presentation Attacks (PAs), and the availability of more sophisticated presentation attack instruments such as 3D silicone face masks will allow attackers to deceive face recognition systems easily. In this work, we propose a novel Presentation Attack Detection (PAD) algorithm based on 3D point clouds captured using the frontal camera of a smartphone to detect presentation attacks. The proposed PAD algorithm, VoxAtnNet, processes 3D point clouds to obtain voxelization to preserve the spatial structure. Then, the voxelized 3D samples were trained using the novel convolutional attention network to detect PAs on the smartphone. Extensive experiments were carried out on the newly constructed 3D face point cloud dataset comprising bona fide and two different 3D PAIs (3D silicone face mask and wrap photo mask), resulting in 3480 samples. The performance of the proposed method was compared with existing methods to benchmark the detection performance using three different evaluation protocols. The experimental results demonstrate the improved performance of the proposed method in detecting both known and unknown face presentation attacks.
Keyword: augmentation
UIClip: A Data-driven Model for Assessing User Interface Design
Authors: Authors: Jason Wu, Yi-Hao Peng, Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols
Abstract
User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by i) assigning a numerical score that represents a UI design's relevance and quality and ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: i) UI code generation, ii) UI design tips generation, and iii) quality-aware UI example search.
A visualization method for data domain changes in CNN networks and the optimization method for selecting thresholds in classification tasks
Authors: Authors: Minzhe Huang, Changwei Nie, Weihong Zhong
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
In recent years, Face Anti-Spoofing (FAS) has played a crucial role in preserving the security of face recognition technology. With the rise of counterfeit face generation techniques, the challenge posed by digitally edited faces to face anti-spoofing is escalating. Existing FAS technologies primarily focus on intercepting physically forged faces and lack a robust solution for cross-domain FAS challenges. Moreover, determining an appropriate threshold to achieve optimal deployment results remains an issue for intra-domain FAS. To address these issues, we propose a visualization method that intuitively reflects the training outcomes of models by visualizing the prediction results on datasets. Additionally, we demonstrate that employing data augmentation techniques, such as downsampling and Gaussian blur, can effectively enhance performance on cross-domain tasks. Building upon our data visualization approach, we also introduce a methodology for setting threshold values based on the distribution of the training dataset. Ultimately, our methods secured us second place in both the Unified Physical-Digital Face Attack Detection competition and the Snapshot Spectral Imaging Face Anti-spoofing contest. The training code is available at https://github.com/SeaRecluse/CVPRW2024.
SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis
Abstract
In the social media, users frequently express personal emotions, a subset of which may indicate potential suicidal tendencies. The implicit and varied forms of expression in internet language complicate accurate and rapid identification of suicidal intent on social media, thus creating challenges for timely intervention efforts. The development of deep learning models for suicide risk detection is a promising solution, but there is a notable lack of relevant datasets, especially in the Chinese context. To address this gap, this study presents a Chinese social media dataset designed for fine-grained suicide risk classification, focusing on indicators such as expressions of suicide intent, methods of suicide, and urgency of timing. Seven pre-trained models were evaluated in two tasks: high and low suicide risk, and fine-grained suicide risk classification on a level of 0 to 10. In our experiments, deep learning models show good performance in distinguishing between high and low suicide risk, with the best model achieving an F1 score of 88.39%. However, the results for fine-grained suicide risk classification were still unsatisfactory, with an weighted F1 score of 50.89%. To address the issues of data imbalance and limited dataset size, we investigated both traditional and advanced, large language model based data augmentation techniques, demonstrating that data augmentation can enhance model performance by up to 4.65% points in F1-score. Notably, the Chinese MentalBERT model, which was pre-trained on psychological domain data, shows superior performance in both tasks. This study provides valuable insights for automatic identification of suicidal individuals, facilitating timely psychological intervention on social media platforms. The source code and data are publicly available.
The Solution for the CVPR2024 NICE Image Captioning Challenge
Authors: Authors: Longfei Huang, Shupeng Zhong, Xiangyu Wu, Ruoxuan Li, Qingguo Chen, Yang Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach ranks first on the leaderboard, achieving a CIDEr score of 234.11 and 1st in all other metrics.
What We Augment When We Augment Visualizations: A Design Elicitation Study of How We Visually Express Data Relationships
Authors: Authors: Grace Guo, John Stasko, Alex Endert
Abstract
Visual augmentations are commonly added to charts and graphs in order to convey richer and more nuanced information about relationships in the data. However, many design spaces proposed for categorizing augmentations were defined in a top-down manner, based on expert heuristics or from surveys of published visualizations. Less well understood are user preferences and intuitions when designing augmentations. In this paper, we address the gap by conducting a design elicitation study, where study participants were asked to draw the different ways they would visually express the meaning of ten different prompts. We obtained 364 drawings from the study, and identified the emergent categories of augmentations used by participants. The contributions of this paper are: (i) a user-defined design space of visualization augmentations, (ii) a repository of hand drawn augmentations made by study participants, and (iii) a discussion of insights into participant considerations, and connections between our study and existing design guidelines.
Keyword: detection
Benchmarking changepoint detection algorithms on cardiac time series
Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds
A visualization method for data domain changes in CNN networks and the optimization method for selecting thresholds in classification tasks
ELEV-VISION-SAM: Integrated Vision Language and Foundation Model for Automated Estimation of Building Lowest Floor Elevation
Greedy Detection and Exclusion of Multiple Faults using Euclidean Distance Matrices
SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers
AED-PADA:Improving Generalizability of Adversarial Example Detection via Principal Adversarial Domain Adaptation
Emerging NGSO Constellations: Spectral Coexistence with GSO Satellite Communication Systems
SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis
Detecting Out-Of-Distribution Earth Observation Images with Diffusion Models
Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model
VoxAtnNet: A 3D Point Clouds Convolutional Neural Network for Generalizable Face Presentation Attack Detection
Modeling Multi-Granularity Context Information Flow for Pavement Crack Detection
uTRAND: Unsupervised Anomaly Detection in Traffic Trajectories
Energy Conserved Failure Detection for NS-IoT Systems
REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking
A Point-Based Approach to Efficient LiDAR Multi-Task Perception
360° phase detector cell for measurement systems based on switched dual multipliers
ECOR: Explainable CLIP for Object Recognition
Ransomware Detection and Classification Using Random Forest: A Case Study with the UGRansome2024 Dataset
Language-Driven Active Learning for Diverse Open-Set 3D Object Detection
Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images
Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media Data
MAiDE-up: Multilingual Deception Detection of GPT-generated Hotel Reviews
Visualizing Intelligent Tutor Interactions for Responsive Pedagogy
Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach
Keyword: face recognition
A visualization method for data domain changes in CNN networks and the optimization method for selecting thresholds in classification tasks
Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations
MLSD-GAN -- Generating Strong High Quality Face Morphing Attacks using Latent Semantic Disentanglement
VoxAtnNet: A 3D Point Clouds Convolutional Neural Network for Generalizable Face Presentation Attack Detection
Keyword: augmentation
UIClip: A Data-driven Model for Assessing User Interface Design
A visualization method for data domain changes in CNN networks and the optimization method for selecting thresholds in classification tasks
SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis
The Solution for the CVPR2024 NICE Image Captioning Challenge
What We Augment When We Augment Visualizations: A Design Elicitation Study of How We Visually Express Data Relationships