New submissions for Mon, 15 Apr 24

Keyword: detection

Enhanced Cooperative Perception for Autonomous Vehicles Using Imperfect Communication

Authors: Authors: Ahmad Sarlak, Hazim Alzorgan, Sayed Pedram Haeri Boroujeni, Abolfazl Razi, Rahul Amin
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08013
Pdf link: https://arxiv.org/pdf/2404.08013
Abstract Sharing and joint processing of camera feeds and sensor measurements, known as Cooperative Perception (CP), has emerged as a new technique to achieve higher perception qualities. CP can enhance the safety of Autonomous Vehicles (AVs) where their individual visual perception quality is compromised by adverse weather conditions (haze as foggy weather), low illumination, winding roads, and crowded traffic. To cover the limitations of former methods, in this paper, we propose a novel approach to realize an optimized CP under constrained communications. At the core of our approach is recruiting the best helper from the available list of front vehicles to augment the visual range and enhance the Object Detection (OD) accuracy of the ego vehicle. In this two-step process, we first select the helper vehicles that contribute the most to CP based on their visual range and lowest motion blur. Next, we implement a radio block optimization among the candidate vehicles to further improve communication efficiency. We specifically focus on pedestrian detection as an exemplary scenario. To validate our approach, we used the CARLA simulator to create a dataset of annotated videos for different driving scenarios where pedestrian detection is challenging for an AV with compromised vision. Our results demonstrate the efficacy of our two-step optimization process in improving the overall performance of cooperative perception in challenging scenarios, substantially improving driving safety under adverse conditions. Finally, we note that the networking assumptions are adopted from LTE Release 14 Mode 4 side-link communication, commonly used for Vehicle-to-Vehicle (V2V) communication. Nonetheless, our method is flexible and applicable to arbitrary V2V communications.
The OxMat dataset: a multimodal resource for the development of AI-driven technologies in maternal and newborn child health
Authors: Authors: M. Jaleed Khan, Ioana Duta, Beth Albert, William Cooke, Manu Vatish, Gabriel Davis Jones
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08024
Pdf link: https://arxiv.org/pdf/2404.08024
Abstract The rapid advancement of Artificial Intelligence (AI) in healthcare presents a unique opportunity for advancements in obstetric care, particularly through the analysis of cardiotocography (CTG) for fetal monitoring. However, the effectiveness of such technologies depends upon the availability of large, high-quality datasets that are suitable for machine learning. This paper introduces the Oxford Maternity (OxMat) dataset, the world's largest curated dataset of CTGs, featuring raw time series CTG data and extensive clinical data for both mothers and babies, which is ideally placed for machine learning. The OxMat dataset addresses the critical gap in women's health data by providing over 177,211 unique CTG recordings from 51,036 pregnancies, carefully curated and reviewed since 1991. The dataset also comprises over 200 antepartum, intrapartum and postpartum clinical variables, ensuring near-complete data for crucial outcomes such as stillbirth and acidaemia. While this dataset also covers the intrapartum stage, around 94% of the constituent CTGS are antepartum. This allows for a unique focus on the underserved antepartum period, in which early detection of at-risk fetuses can significantly improve health outcomes. Our comprehensive review of existing datasets reveals the limitations of current datasets: primarily, their lack of sufficient volume, detailed clinical data and antepartum data. The OxMat dataset lays a foundation for future AI-driven prenatal care, offering a robust resource for developing and testing algorithms aimed at improving maternal and fetal health outcomes.
SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions
Authors: Authors: Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08078
Pdf link: https://arxiv.org/pdf/2404.08078
Abstract Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.
Real-Time Detection and Analysis of Vehicles and Pedestrians using Deep Learning
Authors: Authors: Md Nahid Sadik, Tahmim Hossain, Faisal Sayeed
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08081
Pdf link: https://arxiv.org/pdf/2404.08081
Abstract Computer vision, particularly vehicle and pedestrian identification is critical to the evolution of autonomous driving, artificial intelligence, and video surveillance. Current traffic monitoring systems confront major difficulty in recognizing small objects and pedestrians effectively in real-time, posing a serious risk to public safety and contributing to traffic inefficiency. Recognizing these difficulties, our project focuses on the creation and validation of an advanced deep-learning framework capable of processing complex visual input for precise, real-time recognition of cars and people in a variety of environmental situations. On a dataset representing complicated urban settings, we trained and evaluated different versions of the YOLOv8 and RT-DETR models. The YOLOv8 Large version proved to be the most effective, especially in pedestrian recognition, with great precision and robustness. The results, which include Mean Average Precision and recall rates, demonstrate the model's ability to dramatically improve traffic monitoring and safety. This study makes an important addition to real-time, reliable detection in computer vision, establishing new benchmarks for traffic management systems.
Visual Context-Aware Person Fall Detection
Authors: Authors: Aleksander Nagaj, Zenjie Li, Dim P. Papadopoulos, Kamal Nasrollahi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08088
Pdf link: https://arxiv.org/pdf/2404.08088
Abstract As the global population ages, the number of fall-related incidents is on the rise. Effective fall detection systems, specifically in healthcare sector, are crucial to mitigate the risks associated with such events. This study evaluates the role of visual context, including background objects, on the accuracy of fall detection classifiers. We present a segmentation pipeline to semi-automatically separate individuals and objects in images. Well-established models like ResNet-18, EfficientNetV2-S, and Swin-Small are trained and evaluated. During training, pixel-based transformations are applied to segmented objects, and the models are then evaluated on raw images without segmentation. Our findings highlight the significant influence of visual context on fall detection. The application of Gaussian blur to the image background notably improves the performance and generalization capabilities of all models. Background objects such as beds, chairs, or wheelchairs can challenge fall detection systems, leading to false positive alarms. However, we demonstrate that object-specific contextual transformations during training effectively mitigate this challenge. Further analysis using saliency maps supports our observation that visual context is crucial in classification tasks. We create both dataset processing API and segmentation pipeline, available at https://github.com/A-NGJ/image-segmentation-cli.
Adapting CNNs for Fisheye Cameras without Retraining
Authors: Authors: Ryan Griffiths, Donald G. Dansereau
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.08187
Pdf link: https://arxiv.org/pdf/2404.08187
Abstract The majority of image processing approaches assume images are in or can be rectified to a perspective projection. However, in many applications it is beneficial to use non conventional cameras, such as fisheye cameras, that have a larger field of view (FOV). The issue arises that these large-FOV images can't be rectified to a perspective projection without significant cropping of the original image. To address this issue we propose Rectified Convolutions (RectConv); a new approach for adapting pre-trained convolutional networks to operate with new non-perspective images, without any retraining. Replacing the convolutional layers of the network with RectConv layers allows the network to see both rectified patches and the entire FOV. We demonstrate RectConv adapting multiple pre-trained networks to perform segmentation and detection on fisheye imagery from two publicly available datasets. Our approach requires no additional data or training, and operates directly on the native image as captured from the camera. We believe this work is a step toward adapting the vast resources available for perspective images to operate across a broad range of camera geometries.
HCL-MTSAD: Hierarchical Contrastive Consistency Learning for Accurate Detection of Industrial Multivariate Time Series Anomalies
Authors: Authors: Haili Sun, Yan Huang, Lansheng Han, Cai Fu, Chunjie Zhou
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2404.08224
Pdf link: https://arxiv.org/pdf/2404.08224
Abstract Multivariate Time Series (MTS) anomaly detection focuses on pinpointing samples that diverge from standard operational patterns, which is crucial for ensuring the safety and security of industrial applications. The primary challenge in this domain is to develop representations capable of discerning anomalies effectively. The prevalent methods for anomaly detection in the literature are predominantly reconstruction-based and predictive in nature. However, they typically concentrate on a single-dimensional instance level, thereby not fully harnessing the complex associations inherent in industrial MTS. To address this issue, we propose a novel self-supervised hierarchical contrastive consistency learning method for detecting anomalies in MTS, named HCL-MTSAD. It innovatively leverages data consistency at multiple levels inherent in industrial MTS, systematically capturing consistent associations across four latent levels-measurement, sample, channel, and process. By developing a multi-layer contrastive loss, HCL-MTSAD can extensively mine data consistency and spatio-temporal association, resulting in more informative representations. Subsequently, an anomaly discrimination module, grounded in self-supervised hierarchical contrastive learning, is designed to detect timestamp-level anomalies by calculating multi-scale data consistency. Extensive experiments conducted on six diverse MTS datasets retrieved from real cyber-physical systems and server machines, in comparison with 20 baselines, indicate that HCL-MTSAD's anomaly detection capability outperforms the state-of-the-art benchmark models by an average of 1.8\% in terms of F1 score.
Relational Prompt-based Pre-trained Language Models for Social Event Detection
Authors: Authors: Pu Li, Xiaoyan Yu, Hao Peng, Yantuan Xian, Linqin Wang, Li Sun, Jingyun Zhang, Philip S. Yu
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2404.08263
Pdf link: https://arxiv.org/pdf/2404.08263
Abstract Social Event Detection (SED) aims to identify significant events from social streams, and has a wide application ranging from public opinion analysis to risk management. In recent years, Graph Neural Network (GNN) based solutions have achieved state-of-the-art performance. However, GNN-based methods often struggle with noisy and missing edges between messages, affecting the quality of learned message embedding. Moreover, these methods statically initialize node embedding before training, which, in turn, limits the ability to learn from message texts and relations simultaneously. In this paper, we approach social event detection from a new perspective based on Pre-trained Language Models (PLMs), and present RPLM_SED (Relational prompt-based Pre-trained Language Models for Social Event Detection). We first propose a new pairwise message modeling strategy to construct social messages into message pairs with multi-relational sequences. Secondly, a new multi-relational prompt-based pairwise message learning mechanism is proposed to learn more comprehensive message representation from message pairs with multi-relational prompts using PLMs. Thirdly, we design a new clustering constraint to optimize the encoding process by enhancing intra-cluster compactness and inter-cluster dispersion, making the message representation more distinguishable. We evaluate the RPLM_SED on three real-world datasets, demonstrating that the RPLM_SED model achieves state-of-the-art performance in offline, online, low-resource, and long-tail distribution scenarios for social event detection tasks.
Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis
Authors: Authors: Masahiro Yasuda, Noboru Harada, Yasunori Ohishi, Shoichiro Saito, Akira Nakayama, Nobutaka Ono
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2404.08264
Pdf link: https://arxiv.org/pdf/2404.08264
Abstract Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose Guided Masked sELf-Distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to enable the system to effectively distill the fragmented or redundant target event information obtained by the sensors without being overly dependent on any specific sensors. To validate the effectiveness of the proposed method in novel tasks of distributed multimedia sensor event analysis, we recorded two new datasets that fit the problem setting: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results on these datasets show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced.
Overcoming Scene Context Constraints for Object Detection in wild using Defilters
Authors: Authors: Vamshi Krishna Kancharla, Neelam sinha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08293
Pdf link: https://arxiv.org/pdf/2404.08293
Abstract This paper focuses on improving object detection performance by addressing the issue of image distortions, commonly encountered in uncontrolled acquisition environments. High-level computer vision tasks such as object detection, recognition, and segmentation are particularly sensitive to image distortion. To address this issue, we propose a novel approach employing an image defilter to rectify image distortion prior to object detection. This method enhances object detection accuracy, as models perform optimally when trained on non-distorted images. Our experiments demonstrate that utilizing defiltered images significantly improves mean average precision compared to training object detection models on distorted images. Consequently, our proposed method offers considerable benefits for real-world applications plagued by image distortion. To our knowledge, the contribution lies in employing distortion-removal paradigm for object detection on images captured in natural settings. We achieved an improvement of 0.562 and 0.564 of mean Average precision on validation and test data.
OTFS Channel Estimation and Detection for Channels with Very Large Delay Spread
Authors: Authors: Preety Priya, Yi Hong, Emanuele Viterbo
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2404.08333
Pdf link: https://arxiv.org/pdf/2404.08333
Abstract In low latency applications and in general, for overspread channels, channel delay spread is a large percentage of the transmission frame duration. In this paper, we consider OTFS in an overspread channel exhibiting a delay spread that exceeds the block duration in a frame, where traditional channel estimation (CE) fails. We propose a two-stage CE method based on a delay-Doppler (DD) training frame, consisting of a dual chirp converted from time domain and a higher power pilot. The first stage employs a DD domain embedded pilot CE to estimate the aliased delays (due to modulo operation) and Doppler shifts, followed by identifying all the underspread paths not coinciding with any overspread path. The second stage utilizes time domain dual chirp correlation to estimate the actual delays and Doppler shifts of the remaining paths. This stage also resolves ambiguity in estimating delays and Doppler shifts for paths sharing same aliased delay. Furthermore, we present a modified low-complexity maximum ratio combining (MRC) detection algorithm for OTFS in overspread channels. Finally, we evaluate performance of OTFS using the proposed CE and the modified MRC detection in terms of normalized mean square error (NMSE) and bit error rate (BER).
Counterfactual Explanations for Face Forgery Detection via Adversarial Removal of Artifacts
Authors: Authors: Yang Li, Songlin Yang, Wei Wang, Ziwen He, Bo Peng, Jing Dong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08341
Pdf link: https://arxiv.org/pdf/2404.08341
Abstract Highly realistic AI generated face forgeries known as deepfakes have raised serious social concerns. Although DNN-based face forgery detection models have achieved good performance, they are vulnerable to latest generative methods that have less forgery traces and adversarial attacks. This limitation of generalization and robustness hinders the credibility of detection results and requires more explanations. In this work, we provide counterfactual explanations for face forgery detection from an artifact removal perspective. Specifically, we first invert the forgery images into the StyleGAN latent space, and then adversarially optimize their latent representations with the discrimination supervision from the target detection model. We verify the effectiveness of the proposed explanations from two aspects: (1) Counterfactual Trace Visualization: the enhanced forgery images are useful to reveal artifacts by visually contrasting the original images and two different visualization methods; (2) Transferable Adversarial Attacks: the adversarial forgery images generated by attacking the detection model are able to mislead other detection models, implying the removed artifacts are general. Extensive experiments demonstrate that our method achieves over 90% attack success rate and superior attack transferability. Compared with naive adversarial noise methods, our method adopts both generative and discriminative model priors, and optimize the latent representations in a synthesis-by-analysis way, which forces the search of counterfactual explanations on the natural face manifold. Thus, more general counterfactual traces can be found and better adversarial attack transferability can be achieved.
MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion
Authors: Authors: Zhe Li, Haiwei Pan, Kejia Zhang, Yuhua Wang, Fengming Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08406
Pdf link: https://arxiv.org/pdf/2404.08406
Abstract Multi-modality image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image to represent the imaging scene and facilitate downstream visual tasks comprehensively. In recent years, significant progress has been made in MMIF tasks due to advances in deep neural networks. However, existing methods cannot effectively and efficiently extract modality-specific and modality-fused features constrained by the inherent local reductive bias (CNN) or quadratic computational complexity (Transformers). To overcome this issue, we propose a Mamba-based Dual-phase Fusion (MambaDFuse) model. Firstly, a dual-level feature extractor is designed to capture long-range features from single-modality images by extracting low and high-level features from CNN and Mamba blocks. Then, a dual-phase feature fusion module is proposed to obtain fusion features that combine complementary information from different modalities. It uses the channel exchange method for shallow fusion and the enhanced Multi-modal Mamba (M3) blocks for deep fusion. Finally, the fused image reconstruction module utilizes the inverse transformation of the feature extraction to generate the fused result. Through extensive experiments, our approach achieves promising fusion results in infrared-visible image fusion and medical image fusion. Additionally, in a unified benchmark, MambaDFuse has also demonstrated improved performance in downstream tasks such as object detection. Code with checkpoints will be available after the peer-review process.
Seismic First Break Picking in a Higher Dimension Using Deep Graph Learning
Authors: Authors: Hongtao Wang, Li Long, Jiangshe Zhang, Xiaoli Wei, Chunxia Zhang, Zhenbo Guo
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Geophysics (physics.geo-ph)
Arxiv link: https://arxiv.org/abs/2404.08408
Pdf link: https://arxiv.org/pdf/2404.08408
Abstract Contemporary automatic first break (FB) picking methods typically analyze 1D signals, 2D source gathers, or 3D source-receiver gathers. Utilizing higher-dimensional data, such as 2D or 3D, incorporates global features, improving the stability of local picking. Despite the benefits, high-dimensional data requires structured input and increases computational demands. Addressing this, we propose a novel approach using deep graph learning called DGL-FB, constructing a large graph to efficiently extract information. In this graph, each seismic trace is represented as a node, connected by edges that reflect similarities. To manage the size of the graph, we develop a subgraph sampling technique to streamline model training and inference. Our proposed framework, DGL-FB, leverages deep graph learning for FB picking. It encodes subgraphs into global features using a deep graph encoder. Subsequently, the encoded global features are combined with local node signals and fed into a ResUNet-based 1D segmentation network for FB detection. Field survey evaluations of DGL-FB show superior accuracy and stability compared to a 2D U-Net-based benchmark method.
Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues
Authors: Authors: Xianhua He, Dashuang Liang, Song Yang, Zhanlong Hao, Hui Ma, Binjie Mao, Xi Li, Yao Wang, Pengfei Yan, Ajian Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08450
Pdf link: https://arxiv.org/pdf/2404.08450
Abstract Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to develop and maintain multiple models. To jointly detect physical and digital attacks within a single model, we propose an innovative approach that can adapt to any network architecture. Our approach mainly contains two types of data augmentation, which we call Simulated Physical Spoofing Clues augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). SPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Extensive experiments show that SPSC and SDSC can achieve state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData dataset, respectively. Our method won first place in "Unified Physical-Digital Face Attack Detection" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, respectively. Our code is available at https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge.
MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection
Authors: Authors: Chenqi Kong, Anwei Luo, Song Xia, Yi Yu, Haoliang Li, Alex C. Kot
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08452
Pdf link: https://arxiv.org/pdf/2404.08452
Abstract Deepfakes have recently raised significant trust issues and security concerns among the public. Compared to CNN face forgery detectors, ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. However, these approaches still exhibit the following limitations: (1). Fully fine-tuning ViT-based models from ImageNet weights demands substantial computational and storage resources; (2). ViT-based methods struggle to capture local forgery clues, leading to model bias and limited generalizability. To tackle these challenges, this work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach. MoE-FFD only updates lightweight Low-Rank Adaptation (LoRA) and Adapter layers while keeping the ViT backbone frozen, thereby achieving parameter-efficient training. Moreover, MoE-FFD leverages the expressivity of transformers and local priors of CNNs to simultaneously extract global and local forgery clues. Additionally, novel MoE modules are designed to scale the model's capacity and select optimal forgery experts, further enhancing forgery detection performance. The proposed MoE learning scheme can be seamlessly adapted to various transformer backbones in a plug-and-play manner. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art face forgery detection performance with reduced parameter overhead. The code will be released upon acceptance.
TSLANet: Rethinking Transformers for Time Series Representation Learning
Authors: Authors: Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Xiaoli Li
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.08472
Pdf link: https://arxiv.org/pdf/2404.08472
Abstract Time series data, characterized by its intrinsic long and short-range dependencies, poses a unique challenge across analytical applications. While Transformer-based models excel at capturing long-range dependencies, they face limitations in noise sensitivity, computational efficiency, and overfitting with smaller datasets. In response, we introduce a novel Time Series Lightweight Adaptive Network (TSLANet), as a universal convolutional model for diverse time series tasks. Specifically, we propose an Adaptive Spectral Block, harnessing Fourier analysis to enhance feature representation and to capture both long-term and short-term interactions while mitigating noise via adaptive thresholding. Additionally, we introduce an Interactive Convolution Block and leverage self-supervised learning to refine the capacity of TSLANet for decoding complex temporal patterns and improve its robustness on different datasets. Our comprehensive experiments demonstrate that TSLANet outperforms state-of-the-art models in various tasks spanning classification, forecasting, and anomaly detection, showcasing its resilience and adaptability across a spectrum of noise levels and data sizes. The code is available at \url{https://github.com/emadeldeen24/TSLANet}
Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection
Authors: Authors: Zhiwei Yang, Jing Liu, Peng Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08531
Pdf link: https://arxiv.org/pdf/2404.08531
Abstract Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole
Analyzing Decades-Long Environmental Changes in Namibia Using Archival Aerial Photography and Deep Learning
Authors: Authors: Girmaw Abebe Tadesse, Caleb Robinson, Gilles Quentin Hacheme, Akram Zaytar, Rahul Dodhia, Tsering Wangyal Shawa, Juan M. Lavista Ferres, Emmanuel H. Kreike
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08544
Pdf link: https://arxiv.org/pdf/2404.08544
Abstract This study explores object detection in historical aerial photographs of Namibia to identify long-term environmental changes. Specifically, we aim to identify key objects -- \textit{Waterholes}, \textit{Omuti homesteads}, and \textit{Big trees} -- around Oshikango in Namibia using sub-meter gray-scale aerial imagery from 1943 and 1972. In this work, we propose a workflow for analyzing historical aerial imagery using a deep semantic segmentation model on sparse hand-labels. To this end, we employ a number of strategies including class-weighting, pseudo-labeling and empirical p-value-based filtering to balance skewed and sparse representations of objects in the ground truth data. Results demonstrate the benefits of these different training strategies resulting in an average $F_1=0.661$ and $F_1=0.755$ over the three objects of interest for the 1943 and 1972 imagery, respectively. We also identified that the average size of Waterhole and Big trees increased while the average size of Omutis decreased between 1943 and 1972 reflecting some of the local effects of the massive post-Second World War economic, agricultural, demographic, and environmental changes. This work also highlights the untapped potential of historical aerial photographs in understanding long-term environmental changes beyond Namibia (and Africa). With the lack of adequate satellite technology in the past, archival aerial photography offers a great alternative to uncover decades-long environmental changes.
Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability Detection
Authors: Authors: Litao Li, Steven H. H. Ding, Andrew Walenstein, Philippe Charland, Benjamin C. M. Fung
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08562
Pdf link: https://arxiv.org/pdf/2404.08562
Abstract Software vulnerabilities are a challenge in cybersecurity. Manual security patches are often difficult and slow to be deployed, while new vulnerabilities are created. Binary code vulnerability detection is less studied and more complex compared to source code, and this has important practical implications. Deep learning has become an efficient and powerful tool in the security domain, where it provides end-to-end and accurate prediction. Modern deep learning approaches learn the program semantics through sequence and graph neural networks, using various intermediate representation of programs, such as abstract syntax trees (AST) or control flow graphs (CFG). Due to the complex nature of program execution, the output of an execution depends on the many program states and inputs. Also, a CFG generated from static analysis can be an overestimation of the true program flow. Moreover, the size of programs often does not allow a graph neural network with fixed layers to aggregate global information. To address these issues, we propose DeepEXE, an agent-based implicit neural network that mimics the execution path of a program. We use reinforcement learning to enhance the branching decision at every program state transition and create a dynamic environment to learn the dependency between a vulnerability and certain program states. An implicitly defined neural network enables nearly infinite state transitions until convergence, which captures the structural information at a higher level. The experiments are conducted on two semi-synthetic and two real-world datasets. We show that DeepEXE is an accurate and efficient method and outperforms the state-of-the-art vulnerability detection methods.
FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation
Authors: Authors: Riza Velioglu, Robin Chan, Barbara Hammer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08582
Pdf link: https://arxiv.org/pdf/2404.08582
Abstract In the realm of fashion object detection and segmentation for online shopping images, existing state-of-the-art fashion parsing models encounter limitations, particularly when exposed to non-model-worn apparel and close-up shots. To address these failures, we introduce FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation. The dataset is efficiently curated using our novel annotation tool that leverages recent foundation models. The primary objective of FashionFail is to serve as a test bed for evaluating the robustness of models. Our analysis reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer. Additionally, we propose a baseline approach using naive data augmentation to mitigate common failure cases and improve model robustness. Through this work, we aim to inspire and support further research in fashion item detection and segmentation for industrial applications. The dataset, annotation tool, code, and models are available at \url{https://rizavelioglu.github.io/fashionfail/}.
Pathological Primitive Segmentation Based on Visual Foundation Model with Zero-Shot Mask Generation
Authors: Authors: Abu Bakor Hayat Arnob, Xiangxue Wang, Yiping Jiao, Xiao Gan, Wenlong Ming, Jun Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08584
Pdf link: https://arxiv.org/pdf/2404.08584
Abstract Medical image processing usually requires a model trained with carefully crafted datasets due to unique image characteristics and domain-specific challenges, especially in pathology. Primitive detection and segmentation in digitized tissue samples are essential for objective and automated diagnosis and prognosis of cancer. SAM (Segment Anything Model) has recently been developed to segment general objects from natural images with high accuracy, but it requires human prompts to generate masks. In this work, we present a novel approach that adapts pre-trained natural image encoders of SAM for detection-based region proposals. Regions proposed by a pre-trained encoder are sent to cascaded feature propagation layers for projection. Then, local semantic and global context is aggregated from multi-scale for bounding box localization and classification. Finally, the SAM decoder uses the identified bounding boxes as essential prompts to generate a comprehensive primitive segmentation map. The entire base framework, SAM, requires no additional training or fine-tuning but could produce an end-to-end result for two fundamental segmentation tasks in pathology. Our method compares with state-of-the-art models in F1 score for nuclei detection and binary/multiclass panoptic(bPQ/mPQ) and mask quality(dice) for segmentation quality on the PanNuke dataset while offering end-to-end efficiency. Our model also achieves remarkable Average Precision (+4.5%) on the secondary dataset (HuBMAP Kidney) compared to Faster RCNN. The code is publicly available at https://github.com/learner-codec/autoprom_sam.
Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation
Authors: Authors: Yanhao Zheng, Kai Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08603
Pdf link: https://arxiv.org/pdf/2404.08603
Abstract Open-vocabulary object detection (OVOD) aims at localizing and recognizing visual objects from novel classes unseen at the training time. Whereas, empirical studies reveal that advanced detectors generally assign lower scores to those novel instances, which are inadvertently suppressed during inference by commonly adopted greedy strategies like Non-Maximum Suppression (NMS), leading to sub-optimal detection performance for novel classes. This paper systematically investigates this problem with the commonly-adopted two-stage OVOD paradigm. Specifically, in the region-proposal stage, proposals that contain novel instances showcase lower objectness scores, since they are treated as background proposals during the training phase. Meanwhile, in the object-classification stage, novel objects share lower region-text similarities (i.e., classification scores) due to the biased visual-language alignment by seen training samples. To alleviate this problem, this paper introduces two advanced measures to adjust confidence scores and conserve erroneously dismissed objects: (1) a class-agnostic localization quality estimate via overlap degree of region/object proposals, and (2) a text-guided visual similarity estimate with proxy prototypes for novel classes. Integrated with adjusting techniques specifically designed for the region-proposal and object-classification stages, this paper derives the aggregated confidence estimate for the open-vocabulary object detection paradigm (AggDet). Our AggDet is a generic and training-free post-processing scheme, which consistently bolsters open-vocabulary detectors across model scales and architecture designs. For instance, AggDet receives 3.3% and 1.5% gains on OV-COCO and OV-LVIS benchmarks respectively, without any training cost.
Automatic Quantification of Serial PET/CT Images for Pediatric Hodgkin Lymphoma Patients Using a Longitudinally-Aware Segmentation Network
Authors: Authors: Xin Tie, Muheon Shin, Changhee Lee, Scott B. Perlman, Zachary Huemann, Amy J. Weisman, Sharon M. Castellino, Kara M. Kelly, Kathleen M. McCarten, Adina L. Alazraki, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
Arxiv link: https://arxiv.org/abs/2404.08611
Pdf link: https://arxiv.org/pdf/2404.08611
Abstract $\textbf{Purpose}$: Automatic quantification of longitudinal changes in PET scans for lymphoma patients has proven challenging, as residual disease in interim-therapy scans is often subtle and difficult to detect. Our goal was to develop a longitudinally-aware segmentation network (LAS-Net) that can quantify serial PET/CT images for pediatric Hodgkin lymphoma patients. $\textbf{Materials and Methods}$: This retrospective study included baseline (PET1) and interim (PET2) PET/CT images from 297 patients enrolled in two Children's Oncology Group clinical trials (AHOD1331 and AHOD0831). LAS-Net incorporates longitudinal cross-attention, allowing relevant features from PET1 to inform the analysis of PET2. Model performance was evaluated using Dice coefficients for PET1 and detection F1 scores for PET2. Additionally, we extracted and compared quantitative PET metrics, including metabolic tumor volume (MTV) and total lesion glycolysis (TLG) in PET1, as well as qPET and $\Delta$SUVmax in PET2, against physician measurements. We quantified their agreement using Spearman's $\rho$ correlations and employed bootstrap resampling for statistical analysis. $\textbf{Results}$: LAS-Net detected residual lymphoma in PET2 with an F1 score of 0.606 (precision/recall: 0.615/0.600), outperforming all comparator methods (P<0.01). For baseline segmentation, LAS-Net achieved a mean Dice score of 0.772. In PET quantification, LAS-Net's measurements of qPET, $\Delta$SUVmax, MTV and TLG were strongly correlated with physician measurements, with Spearman's $\rho$ of 0.78, 0.80, 0.93 and 0.96, respectively. The performance remained high, with a slight decrease, in an external testing cohort. $\textbf{Conclusion}$: LAS-Net achieved high performance in quantifying PET metrics across serial scans, highlighting the value of longitudinal awareness in evaluating multi-time-point imaging datasets.
Probing the 3D Awareness of Visual Foundation Models
Authors: Authors: Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08636
Pdf link: https://arxiv.org/pdf/2404.08636
Abstract Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.
COCONut: Modernizing COCO Segmentation
Authors: Authors: Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08639
Pdf link: https://arxiv.org/pdf/2404.08639
Abstract In recent decades, the vision community has witnessed remarkable progress in visual recognition, partially owing to advancements in dataset benchmarks. Notably, the established COCO benchmark has propelled the development of modern detection and segmentation systems. However, the COCO segmentation benchmark has seen comparatively slow improvement over the last decade. Originally equipped with coarse polygon annotations for thing instances, it gradually incorporated coarse superpixel annotations for stuff regions, which were subsequently heuristically amalgamated to yield panoptic segmentation annotations. These annotations, executed by different groups of raters, have resulted not only in coarse segmentation masks but also in inconsistencies between segmentation types. In this study, we undertake a comprehensive reevaluation of the COCO segmentation annotations. By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5.18M panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation dataset. COCONut harmonizes segmentation annotations across semantic, instance, and panoptic segmentation with meticulously crafted high-quality masks, and establishes a robust benchmark for all segmentation tasks. To our knowledge, COCONut stands as the inaugural large-scale universal segmentation dataset, verified by human raters. We anticipate that the release of COCONut will significantly contribute to the community's ability to assess the progress of novel neural networks.
Keyword: face recognition

FaceFilterSense: A Filter-Resistant Face Recognition and Facial Attribute Analysis Framework
Authors: Authors: Shubham Tiwari, Yash Sethia, Ritesh Kumar, Ashwani Tanwar, Rudresh Dwivedi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08277
Pdf link: https://arxiv.org/pdf/2404.08277
Abstract With the advent of social media, fun selfie filters have come into tremendous mainstream use affecting the functioning of facial biometric systems as well as image recognition systems. These filters vary from beautification filters and Augmented Reality (AR)-based filters to filters that modify facial landmarks. Hence, there is a need to assess the impact of such filters on the performance of existing face recognition systems. The limitation associated with existing solutions is that these solutions focus more on the beautification filters. However, the current AR-based filters and filters which distort facial key points are in vogue recently and make the faces highly unrecognizable even to the naked eye. Also, the filters considered are mostly obsolete with limited variations. To mitigate these limitations, we aim to perform a holistic impact analysis of the latest filters and propose an user recognition model with the filtered images. We have utilized a benchmark dataset for baseline images, and applied the latest filters over them to generate a beautified/filtered dataset. Next, we have introduced a model FaceFilterNet for beautified user recognition. In this framework, we also utilize our model to comment on various attributes of the person including age, gender, and ethnicity. In addition, we have also presented a filter-wise impact analysis on face recognition, age estimation, gender, and ethnicity prediction. The proposed method affirms the efficacy of our dataset with an accuracy of 87.25% and an optimal accuracy for facial attribute analysis.
Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues
Authors: Authors: Xianhua He, Dashuang Liang, Song Yang, Zhanlong Hao, Hui Ma, Binjie Mao, Xi Li, Yao Wang, Pengfei Yan, Ajian Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08450
Pdf link: https://arxiv.org/pdf/2404.08450
Abstract Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to develop and maintain multiple models. To jointly detect physical and digital attacks within a single model, we propose an innovative approach that can adapt to any network architecture. Our approach mainly contains two types of data augmentation, which we call Simulated Physical Spoofing Clues augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). SPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Extensive experiments show that SPSC and SDSC can achieve state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData dataset, respectively. Our method won first place in "Unified Physical-Digital Face Attack Detection" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, respectively. Our code is available at https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge.
Keyword: augmentation

AI-Guided Feature Segmentation Techniques to Model Features from Single Crystal Diamond Growth
Authors: Authors: Rohan Reddy Mekala, Elias Garratt, Matthias Muehle, Arjun Srinivasan, Adam Porter, Mikael Lindvall
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08017
Pdf link: https://arxiv.org/pdf/2404.08017
Abstract Process refinement to consistently produce high-quality material over a large area of the grown crystal, enabling various applications from optics crystals to quantum detectors, has long been a goal for diamond growth. Machine learning offers a promising path toward this goal, but faces challenges such as the complexity of features within datasets, their time-dependency, and the volume of data produced per growth run. Accurate spatial feature extraction from image to image for real-time monitoring of diamond growth is crucial yet complicated due to the low-volume and high feature complexity nature of the datasets. This paper compares various traditional and machine learning-driven approaches for feature extraction in the diamond growth domain, proposing a novel deep learning-driven semantic segmentation approach to isolate and classify accurate pixel masks of geometric features like diamond, pocket holder, and background, along with their derivative features based on shape and size. Using an annotation-focused human-in-the-loop software architecture for training datasets, with modules for selective data labeling using active learning, data augmentations, and model-assisted labeling, our approach achieves effective annotation accuracy and drastically reduces labeling time and cost. Deep learning algorithms prove highly efficient in accurately learning complex representations from datasets with many features. Our top-performing model, based on the DeeplabV3plus architecture, achieves outstanding accuracy in classifying features of interest, with accuracies of 96.31% for pocket holder, 98.60% for diamond top, and 91.64% for diamond side features.
Data-Augmentation-Based Dialectal Adaptation for LLMs
Authors: Authors: Fahim Faisal, Antonios Anastasopoulos
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08092
Pdf link: https://arxiv.org/pdf/2404.08092
Abstract This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024, which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTi\'c) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings. Code:https://github.com/ffaisal93/dialect_copa
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Authors: Authors: Zichao Li, Cihang Xie, Ekin Dogus Cubuk
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08197
Pdf link: https://arxiv.org/pdf/2404.08197
Abstract This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.
ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana
Authors: Authors: Monica Romero, Sandra Gomez, Iván G. Torre
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.08368
Pdf link: https://arxiv.org/pdf/2404.08368
Abstract Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities of America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed developing automatic speech recognition (ASR) systems for five indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources and applying data augmentation methods that resulted in the winning approach in this competition. To achieve this, we systematically investigated the impact of different hyperparameters by a Bayesian search on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300M and 1B parameters. Moreover, we performed a global sensitivity analysis to assess the contribution of various hyperparametric configurations to the performances of our best models. Importantly, our results show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. Additionally, we liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria -- and the many experiments performed to pave the way to other researchers to continue improving ASR in minority languages. This insight opens up interesting avenues for future work, allowing for the advancement of ASR techniques in the preservation of minority indigenous and acknowledging the complexities involved in this important endeavour.
Graph data augmentation with Gromow-Wasserstein Barycenters
Authors: Authors: Andrea Ponti
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08376
Pdf link: https://arxiv.org/pdf/2404.08376
Abstract Graphs are ubiquitous in various fields, and deep learning methods have been successful applied in graph classification tasks. However, building large and diverse graph datasets for training can be expensive. While augmentation techniques exist for structured data like images or numerical data, the augmentation of graph data remains challenging. This is primarily due to the complex and non-Euclidean nature of graph data. In this paper, it has been proposed a novel augmentation strategy for graphs that operates in a non-Euclidean space. This approach leverages graphon estimation, which models the generative mechanism of networks sequences. Computational results demonstrate the effectiveness of the proposed augmentation framework in improving the performance of graph classification models. Additionally, using a non-Euclidean distance, specifically the Gromow-Wasserstein distance, results in better approximations of the graphon. This framework also provides a means to validate different graphon estimation approaches, particularly in real-world scenarios where the true graphon is unknown.
Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues
Authors: Authors: Xianhua He, Dashuang Liang, Song Yang, Zhanlong Hao, Hui Ma, Binjie Mao, Xi Li, Yao Wang, Pengfei Yan, Ajian Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08450
Pdf link: https://arxiv.org/pdf/2404.08450
Abstract Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to develop and maintain multiple models. To jointly detect physical and digital attacks within a single model, we propose an innovative approach that can adapt to any network architecture. Our approach mainly contains two types of data augmentation, which we call Simulated Physical Spoofing Clues augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). SPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Extensive experiments show that SPSC and SDSC can achieve state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData dataset, respectively. Our method won first place in "Unified Physical-Digital Face Attack Detection" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, respectively. Our code is available at https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge.
Masked Image Modeling as a Framework for Self-Supervised Learning across Eye Movements
Authors: Authors: Robin Weiler, Matthias Brucklacher, Cyriel M. A. Pennartz, Sander M. Bohté
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08526
Pdf link: https://arxiv.org/pdf/2404.08526
Abstract To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-\allowbreak super-\allowbreak vised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. From a theoretical angle, we find that MIM disentangles neurons in latent space, a property that has been suggested to structure visual representations in primates, without explicit regulation. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under https://github.com/RobinWeiler/FocusMIM
FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation
Authors: Authors: Riza Velioglu, Robin Chan, Barbara Hammer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08582
Pdf link: https://arxiv.org/pdf/2404.08582
Abstract In the realm of fashion object detection and segmentation for online shopping images, existing state-of-the-art fashion parsing models encounter limitations, particularly when exposed to non-model-worn apparel and close-up shots. To address these failures, we introduce FashionFail; a new fashion dataset with e-commerce images for object detection and segmentation. The dataset is efficiently curated using our novel annotation tool that leverages recent foundation models. The primary objective of FashionFail is to serve as a test bed for evaluating the robustness of models. Our analysis reveals the shortcomings of leading models, such as Attribute-Mask R-CNN and Fashionformer. Additionally, we propose a baseline approach using naive data augmentation to mitigate common failure cases and improve model robustness. Through this work, we aim to inspire and support further research in fashion item detection and segmentation for industrial applications. The dataset, annotation tool, code, and models are available at \url{https://rizavelioglu.github.io/fashionfail/}.
Generating Synthetic Time Series Data for Cyber-Physical Systems
Authors: Authors: Alexander Sommers, Somayeh Bakhtiari Ramezani, Logan Cummins, Sudip Mittal, Shahram Rahimi, Maria Seale, Joseph Jaboure
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08601
Pdf link: https://arxiv.org/pdf/2404.08601
Abstract Data augmentation is an important facilitator of deep learning applications in the time series domain. A gap is identified in the literature, demonstrating sparse exploration of the transformer, the dominant sequence model, for data augmentation in time series. A architecture hybridizing several successful priors is put forth and tested using a powerful time domain similarity metric. Results suggest the challenge of this domain, and several valuable directions for future work.

LeeKyungwook / get-arxiv-noti

New submissions for Mon, 15 Apr 24 #1063

Keyword: detection

Enhanced Cooperative Perception for Autonomous Vehicles Using Imperfect Communication

The OxMat dataset: a multimodal resource for the development of AI-driven technologies in maternal and newborn child health

SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Real-Time Detection and Analysis of Vehicles and Pedestrians using Deep Learning

Visual Context-Aware Person Fall Detection

Adapting CNNs for Fisheye Cameras without Retraining

HCL-MTSAD: Hierarchical Contrastive Consistency Learning for Accurate Detection of Industrial Multivariate Time Series Anomalies

Relational Prompt-based Pre-trained Language Models for Social Event Detection

Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Overcoming Scene Context Constraints for Object Detection in wild using Defilters

OTFS Channel Estimation and Detection for Channels with Very Large Delay Spread

Counterfactual Explanations for Face Forgery Detection via Adversarial Removal of Artifacts

MambaDFuse: A Mamba-based Dual-phase Model for Multi-modality Image Fusion

Seismic First Break Picking in a Higher Dimension Using Deep Graph Learning

Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues

MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection

TSLANet: Rethinking Transformers for Time Series Representation Learning

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Analyzing Decades-Long Environmental Changes in Namibia Using Archival Aerial Photography and Deep Learning

Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability Detection

FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation

Pathological Primitive Segmentation Based on Visual Foundation Model with Zero-Shot Mask Generation

Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation

Automatic Quantification of Serial PET/CT Images for Pediatric Hodgkin Lymphoma Patients Using a Longitudinally-Aware Segmentation Network

Probing the 3D Awareness of Visual Foundation Models

COCONut: Modernizing COCO Segmentation

Keyword: face recognition

FaceFilterSense: A Filter-Resistant Face Recognition and Facial Attribute Analysis Framework

Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues

Keyword: augmentation

AI-Guided Feature Segmentation Techniques to Model Features from Single Crystal Diamond Growth

Data-Augmentation-Based Dialectal Adaptation for LLMs

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana

Graph data augmentation with Gromow-Wasserstein Barycenters

Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues

Masked Image Modeling as a Framework for Self-Supervised Learning across Eye Movements

FashionFail: Addressing Failure Cases in Fashion Object Detection and Segmentation

Generating Synthetic Time Series Data for Cyber-Physical Systems