Showing new listings for Wednesday, 25 September 2024

Keyword: detection

Title:

      Deciphering Cardiac Destiny: Unveiling Future Risks Through Cutting-Edge Machine Learning Approaches

Authors: G.Divya, M.Naga SravanKumar, T.JayaDharani, B.Pavan, K.Praveen
Subjects: Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Cardiac arrest remains a leading cause of death worldwide, necessitating proactive measures for early detection and intervention. This project aims to develop and assess predictive models for the timely identification of cardiac arrest incidents, utilizing a comprehensive dataset of clinical parameters and patient histories. Employing machine learning (ML) algorithms like XGBoost, Gradient Boosting, and Naive Bayes, alongside a deep learning (DL) approach with Recurrent Neural Networks (RNNs), we aim to enhance early detection capabilities. Rigorous experimentation and validation revealed the superior performance of the RNN model, which effectively captures complex temporal dependencies within the data. Our findings highlight the efficacy of these models in accurately predicting cardiac arrest likelihood, emphasizing the potential for improved patient care through early risk stratification and personalized interventions. By leveraging advanced analytics, healthcare providers can proactively mitigate cardiac arrest risk, optimize resource allocation, and improve patient outcomes. This research highlights the transformative potential of machine learning and deep learning techniques in managing cardiovascular risk and advances the field of predictive healthcare analytics.
Title:
```
  Global Context Enhanced Anomaly Detection of Cyber Attacks via Decoupled Graph Neural Networks
```
Authors: Ahmad Hafez
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Recently, there has been a substantial amount of interest in GNN-based anomaly detection. Existing efforts have focused on simultaneously mastering the node representations and the classifier necessary for identifying abnormalities with relatively shallow models to create an embedding. Therefore, the existing state-of-the-art models are incapable of capturing nonlinear network information and producing suboptimal outcomes. In this thesis, we deploy decoupled GNNs to overcome this issue. Specifically, we decouple the essential node representations and classifier for detecting anomalies. In addition, for node representation learning, we develop a GNN architecture with two modules for aggregating node feature information to produce the final node embedding. Finally, we conduct empirical experiments to verify the effectiveness of our proposed approach. The findings demonstrate that decoupled training along with the global context enhanced representation of the nodes is superior to the state-of-the-art models in terms of AUC and introduces a novel way of capturing the node information.
Title:
```
  WISDOM: An AI-powered framework for emerging research detection using weak signal analysis and advanced topic modeling
```
Authors: Ashkan Ebadi, Alain Auger, Yvan Gauthier
Subjects: Subjects: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The landscape of science and technology is characterized by its dynamic and evolving nature, constantly reshaped by new discoveries, innovations, and paradigm shifts. Moreover, science is undergoing a remarkable shift towards increasing interdisciplinary collaboration, where the convergence of diverse fields fosters innovative solutions to complex problems. Detecting emerging scientific topics is paramount as it enables industries, policymakers, and innovators to adapt their strategies, investments, and regulations proactively. As the common approach for detecting emerging technologies, despite being useful, bibliometric analyses may suffer from oversimplification and/or misinterpretation of complex interdisciplinary trends. In addition, relying solely on domain experts to pinpoint emerging technologies from science and technology trends might restrict the ability to systematically analyze extensive information and introduce subjective judgments into the interpretations. To overcome these drawbacks, in this work, we present an automated artificial intelligence-enabled framework, called WISDOM, for detecting emerging research themes using advanced topic modeling and weak signal analysis. The proposed approach can assist strategic planners and domain experts in more effectively recognizing and tracking trends related to emerging topics by swiftly processing and analyzing vast volumes of data, uncovering hidden cross-disciplinary patterns, and offering unbiased insights, thereby enhancing the efficiency and objectivity of the detection process. As the case technology, we assess WISDOM's performance in identifying emerging research as well as their trends, in the field of underwater sensing technologies using scientific papers published between 2004 and 2021.
Title:
```
  Damage detection in an uncertain nonlinear beam based on stochastic Volterra series
```
Authors: Luis Gustavo Giacon Villani, Samuel da Silva, Americo Cunha Jr
Subjects: Subjects: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The damage detection problem in mechanical systems, using vibration measurements, is commonly called Structural Health Monitoring (SHM). Many tools are able to detect damages by changes in the vibration pattern, mainly, when damages induce nonlinear behavior. However, a more difficult problem is to detect structural variation associated with damage, when the mechanical system has nonlinear behavior even in the reference condition. In these cases, more sophisticated methods are required to detect if the changes in the response are based on some structural variation or changes in the vibration regime, because both can generate nonlinearities. Among the many ways to solve this problem, the use of the Volterra series has several favorable points, because they are a generalization of the linear convolution, allowing the separation of linear and nonlinear contributions by input filtering through the Volterra kernels. On the other hand, the presence of uncertainties in mechanical systems, due to noise, geometric imperfections, manufacturing irregularities, environmental conditions, and others, can also change the responses, becoming more difficult the damage detection procedure. An approach based on a stochastic version of Volterra series is proposed to be used in the detection of a breathing crack in a beam vibrating in a nonlinear regime of motion, even in reference condition (without crack). The system uncertainties are simulated by the variation imposed in the linear stiffness and damping coefficient. The results show, that the nonlinear analysis done, considering the high order Volterra kernels, allows the approach to detect the crack with a small propagation and probability confidence, even in the presence of uncertainties.
Title:
```
  Trajectory Anomaly Detection with Language Models
```
Authors: Jonathan Mbuya, Dieter Pfoser, Antonios Anastasopoulos
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper presents a novel approach for trajectory anomaly detection using an autoregressive causal-attention model, termed LM-TAD. This method leverages the similarities between language statements and trajectories, both of which consist of ordered elements requiring coherence through external rules and contextual variations. By treating trajectories as sequences of tokens, our model learns the probability distributions over trajectories, enabling the identification of anomalous locations with high precision. We incorporate user-specific tokens to account for individual behavior patterns, enhancing anomaly detection tailored to user context. Our experiments demonstrate the effectiveness of LM-TAD on both synthetic and real-world datasets. In particular, the model outperforms existing methods on the Pattern of Life (PoL) dataset by detecting user-contextual anomalies and achieves competitive results on the Porto taxi dataset, highlighting its adaptability and robustness. Additionally, we introduce the use of perplexity and surprisal rate metrics for detecting outliers and pinpointing specific anomalous locations within trajectories. The LM-TAD framework supports various trajectory representations, including GPS coordinates, staypoints, and activity types, proving its versatility in handling diverse trajectory data. Moreover, our approach is well-suited for online trajectory anomaly detection, significantly reducing computational latency by caching key-value states of the attention mechanism, thereby avoiding repeated computations.
Title:
```
  Uncovering Coordinated Cross-Platform Information Operations Threatening the Integrity of the 2024 U.S. Presidential Election Online Discussion
```
Authors: Marco Minici, Luca Luceri, Federico Cinus, Emilio Ferrara
Subjects: Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Information Operations (IOs) pose a significant threat to the integrity of democratic processes, with the potential to influence election-related online discourse. In anticipation of the 2024 U.S. presidential election, we present a study aimed at uncovering the digital traces of coordinated IOs on $\mathbb{X}$ (formerly Twitter). Using our machine learning framework for detecting online coordination, we analyze a dataset comprising election-related conversations on $\mathbb{X}$ from May 2024. This reveals a network of coordinated inauthentic actors, displaying notable similarities in their link-sharing behaviors. Our analysis shows concerted efforts by these accounts to disseminate misleading, redundant, and biased information across the Web through a coordinated cross-platform information operation: The links shared by this network frequently direct users to other social media platforms or suspicious websites featuring low-quality political content and, in turn, promoting the same $\mathbb{X}$ and YouTube accounts. Members of this network also shared deceptive images generated by AI, accompanied by language attacking political figures and symbolic imagery intended to convey power and dominance. While $\mathbb{X}$ has suspended a subset of these accounts, more than 75% of the coordinated network remains active. Our findings underscore the critical role of developing computational models to scale up the detection of threats on large social media platforms, and emphasize the broader implications of these techniques to detect IOs across the wider Web.
Title:
```
  VLMine: Long-Tail Data Mining with Vision Language Models
```
Authors: Mao Ye, Gregory P. Meyer, Zaiwei Zhang, Dennis Park, Siva Karthik Mustikovela, Yuning Chai, Eric M Wolff
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Ensuring robust performance on long-tail examples is an important problem for many real-world applications of machine learning, such as autonomous driving. This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM). Our approach utilizes a VLM to summarize the content of an image into a set of keywords, and we identify rare examples based on keyword frequency. We find that the VLM offers a distinct signal for identifying long-tail examples when compared to conventional methods based on model uncertainty. Therefore, we propose a simple and general approach for integrating signals from multiple mining algorithms. We evaluate the proposed method on two diverse tasks: 2D image classification, in which inter-class variation is the primary source of data diversity, and on 3D object detection, where intra-class variation is the main concern. Furthermore, through the detection task, we demonstrate that the knowledge extracted from 2D images is transferable to the 3D domain. Our experiments consistently show large improvements (between 10\% and 50\%) over the baseline techniques on several representative benchmarks: ImageNet-LT, Places-LT, and the Waymo Open Dataset.
Title:
```
  AgriNeRF: Neural Radiance Fields for Agriculture in Challenging Lighting Conditions
```
Authors: Samarth Chopra, Fernando Cladera, Varun Murali, Vijay Kumar
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Neural Radiance Fields (NeRFs) have shown significant promise in 3D scene reconstruction and novel view synthesis. In agricultural settings, NeRFs can serve as digital twins, providing critical information about fruit detection for yield estimation and other important metrics for farmers. However, traditional NeRFs are not robust to challenging lighting conditions, such as low-light, extreme bright light and varying lighting. To address these issues, this work leverages three different sensors: an RGB camera, an event camera and a thermal camera. Our RGB scene reconstruction shows an improvement in PSNR and SSIM by +2.06 dB and +8.3% respectively. Our cross-spectral scene reconstruction enhances downstream fruit detection by +43.0% in mAP50 and +61.1% increase in mAP50-95. The integration of additional sensors leads to a more robust and informative NeRF. We demonstrate that our multi-modal system yields high quality photo-realistic reconstructions under various tree canopy covers and at different times of the day. This work results in the development of a resilient NeRF, capable of performing well in visibly degraded scenarios, as well as a learnt cross-spectral representation, that is used for automated fruit detection.
Title:
```
  Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
```
Authors: Angelos Mavrogiannis, Dehao Yuan, Yiannis Aloimonos
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action programming API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.
Title:
```
  VaLID: Verification as Late Integration of Detections for LiDAR-Camera Fusion
```
Authors: Vanshika Vats, Marzia Binta Nizam, James Davis
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Vehicle object detection is possible using both LiDAR and camera data. Methods using LiDAR generally outperform those using cameras only. The highest accuracy methods utilize both of these modalities through data fusion. In our study, we propose a model-independent late fusion method, VaLID, which validates whether each predicted bounding box is acceptable or not. Our method verifies the higher-performing, yet overly optimistic LiDAR model detections using camera detections that are obtained from either specially trained, general, or open-vocabulary models. VaLID uses a simple multi-layer perceptron trained with a high recall bias to reduce the false predictions made by the LiDAR detector, while still preserving the true ones. Evaluating with multiple combinations of LiDAR and camera detectors on the KITTI dataset, we reduce false positives by an average of 63.9%, thus outperforming the individual detectors on 2D average precision (2DAP). Our approach is model-agnostic and demonstrates state-of-the-art competitive performance even when using generic camera detectors that were not trained specifically for this dataset.
Title:
```
  A positive meshless finite difference scheme for scalar conservation laws with adaptive artificial viscosity driven by fault detection
```
Authors: Cesare Bracco, Oleg Davydov, Carlotta Giannelli, Alessandra Sestini
Subjects: Subjects: Numerical Analysis (math.NA)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We present a meshless finite difference method for multivariate scalar conservation laws that generates positive schemes satisfying a local maximum principle on irregular nodes and relies on artificial viscosity for shock capturing. Coupling two different numerical differentiation formulas and adaptive selection of the sets of influence allows to meet a local CFL condition without any a priori time step restriction. Artificial viscosity term is chosen in an adaptive way by applying it only in the vicinity of the sharp features of the solution identified by an algorithm for fault detection on scattered data. Numerical tests demonstrate a robust performance of the method on irregular nodes and advantages of adaptive artificial viscosity. The accuracy of the obtained solutions is comparable to that for standard monotone methods available only on Cartesian grids.
Title:
```
  Mixing Data-driven and Geometric Models for Satellite Docking Port State Estimation using an RGB or Event Camera
```
Authors: Cedric Le Gentil, Jack Naylor, Nuwan Munasinghe, Jasprabhjit Mehami, Benny Dai, Mikhail Asavkin, Donald G. Dansereau, Teresa Vidal-Calleja
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In-orbit automated servicing is a promising path towards lowering the cost of satellite operations and reducing the amount of orbital debris. For this purpose, we present a pipeline for automated satellite docking port detection and state estimation using monocular vision data from standard RGB sensing or an event camera. Rather than taking snapshots of the environment, an event camera has independent pixels that asynchronously respond to light changes, offering advantages such as high dynamic range, low power consumption and latency, etc. This work focuses on satellite-agnostic operations (only a geometric knowledge of the actual port is required) using the recently released Lockheed Martin Mission Augmentation Port (LM-MAP) as the target. By leveraging shallow data-driven techniques to preprocess the incoming data to highlight the LM-MAP's reflective navigational aids and then using basic geometric models for state estimation, we present a lightweight and data-efficient pipeline that can be used independently with either RGB or event cameras. We demonstrate the soundness of the pipeline and perform a quantitative comparison of the two modalities based on data collected with a photometrically accurate test bench that includes a robotic arm to simulate the target satellite's uncontrolled motion.
Title:
```
  Identified-and-Targeted: The First Early Evidence of the Privacy-Invasive Use of Browser Fingerprinting for Online Tracking
```
Authors: Zengrui Liu, Jimmy Dani, Shujiang Wu, Yinzhi Cao, Nitesh Saxena
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract While advertising has become commonplace in today's online interactions, there is a notable dearth of research investigating the extent to which browser fingerprinting is harnessed for user tracking and targeted advertising. Prior studies only measured whether fingerprinting-related scripts are being run on the websites but that in itself does not necessarily mean that fingerprinting is being used for the privacy-invasive purpose of online tracking because fingerprinting might be deployed for the defensive purposes of bot/fraud detection and user authentication. It is imperative to address the mounting concerns regarding the utilization of browser fingerprinting in the realm of online advertising. To understand the privacy-invasive use of fingerprinting for user tracking, this paper introduces a new framework ``FPTrace'' (fingerprinting-based tracking assessment and comprehensive evaluation framework) designed to identify alterations in advertisements resulting from adjustments in browser fingerprinting settings. Our approach involves emulating genuine user interactions, capturing advertiser bid data, and closely monitoring HTTP information. Using FPTrace we conduct a large-scale measurement study to identify whether browser fingerprinting is being used for the purpose of user tracking and ad targeting. The results we have obtained provide robust evidence supporting the utilization of browser fingerprinting for the purposes of advertisement tracking and targeting. This is substantiated by significant disparities in bid values and a reduction in HTTP records subsequent to changes in fingerprinting. In conclusion, our research unveils the widespread employment of browser fingerprinting in online advertising, prompting critical considerations regarding user privacy and data security within the digital advertising landscape.
Title:
```
  PDT: Uav Target Detection Dataset for Pests and Diseases Tree
```
Authors: Mingle Zhou, Rui Xing, Delong Han, Zhiyong Qi, Gang Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract UAVs emerge as the optimal carriers for visual weed iden?tification and integrated pest and disease management in crops. How?ever, the absence of specialized datasets impedes the advancement of model development in this domain. To address this, we have developed the Pests and Diseases Tree dataset (PDT dataset). PDT dataset repre?sents the first high-precision UAV-based dataset for targeted detection of tree pests and diseases, which is collected in real-world operational environments and aims to fill the gap in available datasets for this field. Moreover, by aggregating public datasets and network data, we further introduced the Common Weed and Crop dataset (CWC dataset) to ad?dress the challenge of inadequate classification capabilities of test models within datasets for this field. Finally, we propose the YOLO-Dense Pest (YOLO-DP) model for high-precision object detection of weed, pest, and disease crop images. We re-evaluate the state-of-the-art detection models with our proposed PDT dataset and CWC dataset, showing the completeness of the dataset and the effectiveness of the YOLO-DP. The proposed PDT dataset, CWC dataset, and YOLO-DP model are pre?sented at this https URL.
Title:
```
  A Comprehensive Evaluation of Large Language Models on Mental Illnesses
```
Authors: Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Large language models have shown promise in various domains, including healthcare. In this study, we conduct a comprehensive evaluation of LLMs in the context of mental health tasks using social media data. We explore the zero-shot (ZS) and few-shot (FS) capabilities of various LLMs, including GPT-4, Llama 3, Gemini, and others, on tasks such as binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Our evaluation involved 33 models testing 9 main prompt templates across the tasks. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets. Moreover, prompt engineering played a crucial role in enhancing model performance. Notably, the Mixtral 8x22b model showed an improvement of over 20%, while Gemma 7b experienced a similar boost in performance. In the task of disorder severity evaluation, we observed that FS learning significantly improved the model's accuracy, highlighting the importance of contextual examples in complex assessments. Notably, the Phi-3-mini model exhibited a substantial increase in performance, with balanced accuracy improving by over 6.80% and mean average error dropping by nearly 1.3 when moving from ZS to FS learning. In the psychiatric knowledge task, recent models generally outperformed older, larger counterparts, with the Llama 3.1 405b achieving an accuracy of 91.2%. Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering. Furthermore, the ethical guards imposed by many LLM providers hamper the ability to accurately evaluate their performance, due to tendency to not respond to potentially sensitive queries.
Title:
```
  A Survey of Stance Detection on Social Media: New Directions and Perspectives
```
Authors: Bowen Zhang, Genan Dai, Fuqiang Niu, Nan Yin, Xiaomao Fan, Hu Huang
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In modern digital environments, users frequently express opinions on contentious topics, providing a wealth of information on prevailing attitudes. The systematic analysis of these opinions offers valuable insights for decision-making in various sectors, including marketing and politics. As a result, stance detection has emerged as a crucial subfield within affective computing, enabling the automatic detection of user stances in social media conversations and providing a nuanced understanding of public sentiment on complex issues. Recent years have seen a surge of research interest in developing effective stance detection methods, with contributions from multiple communities, including natural language processing, web science, and social computing. This paper provides a comprehensive survey of stance detection techniques on social media, covering task definitions, datasets, approaches, and future works. We review traditional stance detection models, as well as state-of-the-art methods based on large language models, and discuss their strengths and limitations. Our survey highlights the importance of stance detection in understanding public opinion and sentiment, and identifies gaps in current research. We conclude by outlining potential future directions for stance detection on social media, including the need for more robust and generalizable models, and the importance of addressing emerging challenges such as multi-modal stance detection and stance detection in low-resource languages.
Title:
```
  Real-Time Pedestrian Detection on IoT Edge Devices: A Lightweight Deep Learning Approach
```
Authors: Muhammad Dany Alfikri, Rafael Kaliski
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Artificial intelligence (AI) has become integral to our everyday lives. Computer vision has advanced to the point where it can play the safety critical role of detecting pedestrians at road intersections in intelligent transportation systems and alert vehicular traffic as to potential collisions. Centralized computing analyzes camera feeds and generates alerts for nearby vehicles. However, real-time applications face challenges such as latency, limited data transfer speeds, and the risk of life loss. Edge servers offer a potential solution for real-time applications, providing localized computing and storage resources and lower response times. Unfortunately, edge servers have limited processing power. Lightweight deep learning (DL) techniques enable edge servers to utilize compressed deep neural network (DNN) models. The research explores implementing a lightweight DL model on Artificial Intelligence of Things (AIoT) edge devices. An optimized You Only Look Once (YOLO) based DL model is deployed for real-time pedestrian detection, with detection events transmitted to the edge server using the Message Queuing Telemetry Transport (MQTT) protocol. The simulation results demonstrate that the optimized YOLO model can achieve real-time pedestrian detection, with a fast inference speed of 147 milliseconds, a frame rate of 2.3 frames per second, and an accuracy of 78%, representing significant improvements over baseline models.
Title:
```
  Automated Assessment of Multimodal Answer Sheets in the STEM domain
```
Authors: Rajlaxmi Patil, Aditya Ashutosh Kulkarni, Ruturaj Ghatage, Sharvi Endait, Geetanjali Kale, Raviraj Joshi
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In the domain of education, the integration of,technology has led to a transformative era, reshaping traditional,learning paradigms. Central to this evolution is the automation,of grading processes, particularly within the STEM domain encompassing Science, Technology, Engineering, and Mathematics.,While efforts to automate grading have been made in subjects,like Literature, the multifaceted nature of STEM assessments,presents unique challenges, ranging from quantitative analysis,to the interpretation of handwritten diagrams. To address these,challenges, this research endeavors to develop efficient and reliable grading methods through the implementation of automated,assessment techniques using Artificial Intelligence (AI). Our,contributions lie in two key areas: firstly, the development of a,robust system for evaluating textual answers in STEM, leveraging,sample answers for precise comparison and grading, enabled by,advanced algorithms and natural language processing techniques.,Secondly, a focus on enhancing diagram evaluation, particularly,flowcharts, within the STEM context, by transforming diagrams,into textual representations for nuanced assessment using a,Large Language Model (LLM). By bridging the gap between,visual representation and semantic meaning, our approach ensures accurate evaluation while minimizing manual intervention.,Through the integration of models such as CRAFT for text,extraction and YoloV5 for object detection, coupled with LLMs,like Mistral-7B for textual evaluation, our methodology facilitates,comprehensive assessment of multimodal answer sheets. This,paper provides a detailed account of our methodology, challenges,encountered, results, and implications, emphasizing the potential,of AI-driven approaches in revolutionizing grading practices in,STEM education.
Title:
```
  A Multi-Level Approach for Class Imbalance Problem in Federated Learning for Remote Industry 4.0 Applications
```
Authors: Razin Farhan Hussain, Mohsen Amini Salehi
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Deep neural network (DNN) models are effective solutions for industry 4.0 applications (\eg oil spill detection, fire detection, anomaly detection). However, training a DNN network model needs a considerable amount of data collected from various sources and transferred to the central cloud server that can be expensive and sensitive to privacy. For instance, in the remote offshore oil field where network connectivity is vulnerable, a federated fog environment can be a potential computing platform. Hence it is feasible to perform computation within the federation. On the contrary, performing a DNN model training using fog systems poses a security issue that the federated learning (FL) technique can resolve. In this case, the new challenge is the class imbalance problem that can be inherited in local data sets and can degrade the performance of the global model. Therefore, FL training needs to be performed considering the class imbalance problem locally. In addition, an efficient technique to select the relevant worker model needs to be adopted at the global level to increase the robustness of the global model. Accordingly, we utilize one of the suitable loss functions addressing the class imbalance in workers at the local level. In addition, we employ a dynamic threshold mechanism with user-defined worker's weight to efficiently select workers for aggregation that improve the global model's robustness. Finally, we perform an extensive empirical evaluation to explore the benefits of our solution and find up to 3-5% performance improvement than baseline federated learning methods.
Title:
```
  A Computer Vision Approach for Autonomous Cars to Drive Safe at Construction Zone
```
Authors: Abu Shad Ahammed, Md Shahi Amran Hossain, Roman Obermaisser
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract To build a smarter and safer city, a secure, efficient, and sustainable transportation system is a key requirement. The autonomous driving system (ADS) plays an important role in the development of smart transportation and is considered one of the major challenges facing the automotive sector in recent decades. A car equipped with an autonomous driving system (ADS) comes with various cutting-edge functionalities such as adaptive cruise control, collision alerts, automated parking, and more. A primary area of research within ADAS involves identifying road obstacles in construction zones regardless of the driving environment. This paper presents an innovative and highly accurate road obstacle detection model utilizing computer vision technology that can be activated in construction zones and functions under diverse drift conditions, ultimately contributing to build a safer road transportation system. The model developed with the YOLO framework achieved a mean average precision exceeding 94\% and demonstrated an inference time of 1.6 milliseconds on the validation dataset, underscoring the robustness of the methodology applied to mitigate hazards and risks for autonomous vehicles.
Title:
```
  Deep Learning Techniques for Automatic Lateral X-ray Cephalometric Landmark Detection: Is the Problem Solved?
```
Authors: Hongyuan Zhang, Ching-Wei Wang, Hikam Muzakky, Juan Dai, Xuguang Li, Chenglong Ma, Qian Wu, Xianan Cui, Kunlun Xu, Pengfei He, Dongqian Guo, Xianlong Wang, Hyunseok Lee, Zhangnan Zhong, Zhu Zhu, Bingsheng Huang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Localization of the craniofacial landmarks from lateral cephalograms is a fundamental task in cephalometric analysis. The automation of the corresponding tasks has thus been the subject of intense research over the past decades. In this paper, we introduce the "Cephalometric Landmark Detection (CL-Detection)" dataset, which is the largest publicly available and comprehensive dataset for cephalometric landmark detection. This multi-center and multi-vendor dataset includes 600 lateral X-ray images with 38 landmarks acquired with different equipment from three medical centers. The overarching objective of this paper is to measure how far state-of-the-art deep learning methods can go for cephalometric landmark detection. Following the 2023 MICCAI CL-Detection Challenge, we report the results of the top ten research groups using deep learning methods. Results show that the best methods closely approximate the expert analysis, achieving a mean detection rate of 75.719% and a mean radial error of 1.518 mm. While there is room for improvement, these findings undeniably open the door to highly accurate and fully automatic location of craniofacial landmarks. We also identify scenarios for which deep learning methods are still failing. Both the dataset and detailed results are publicly available online, while the platform will remain open for the community to benchmark future algorithm developments at this https URL.
Title:
```
  Zero-Shot Detection of AI-Generated Images
```
Authors: Davide Cozzolino, Giovanni Poggi, Matthias Nießner, Luisa Verdoliva
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Detecting AI-generated images has become an extraordinarily difficult challenge as new generative architectures emerge on a daily basis with more and more capabilities and unprecedented realism. New versions of many commercial tools, such as DALLE, Midjourney, and Stable Diffusion, have been released recently, and it is impractical to continually update and retrain supervised forensic detectors to handle such a large variety of models. To address this challenge, we propose a zero-shot entropy-based detector (ZED) that neither needs AI-generated training data nor relies on knowledge of generative architectures to artificially synthesize their artifacts. Inspired by recent works on machine-generated text detection, our idea is to measure how surprising the image under analysis is compared to a model of real images. To this end, we rely on a lossless image encoder that estimates the probability distribution of each pixel given its context. To ensure computational efficiency, the encoder has a multi-resolution architecture and contexts comprise mostly pixels of the lower-resolution version of the image.Since only real images are needed to learn the model, the detector is independent of generator architectures and synthetic training data. Using a single discriminative feature, the proposed detector achieves state-of-the-art performance. On a wide variety of generative models it achieves an average improvement of more than 3% over the SoTA in terms of accuracy. Code is available at this https URL.
Title:
```
  Symmetries and Expressive Requirements for Learning General Policies
```
Authors: Dominik Drexler, Simon Ståhlberg, Blai Bonet, Hector Geffner
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract State symmetries play an important role in planning and generalized planning. In the first case, state symmetries can be used to reduce the size of the search; in the second, to reduce the size of the training set. In the case of general planning, however, it is also critical to distinguish non-symmetric states, i.e., states that represent non-isomorphic relational structures. However, while the language of first-order logic distinguishes non-symmetric states, the languages and architectures used to represent and learn general policies do not. In particular, recent approaches for learning general policies use state features derived from description logics or learned via graph neural networks (GNNs) that are known to be limited by the expressive power of C_2, first-order logic with two variables and counting. In this work, we address the problem of detecting symmetries in planning and generalized planning and use the results to assess the expressive requirements for learning general policies over various planning domains. For this, we map planning states to plain graphs, run off-the-shelf algorithms to determine whether two states are isomorphic with respect to the goal, and run coloring algorithms to determine if C_2 features computed logically or via GNNs distinguish non-isomorphic states. Symmetry detection results in more effective learning, while the failure to detect non-symmetries prevents general policies from being learned at all in certain domains.
Title:
```
  Konstruktor: A Strong Baseline for Simple Knowledge Graph Question Answering
```
Authors: Maria Lysyuk, Mikhail Salnikov, Pavel Braslavski, Alexander Panchenko
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract While being one of the most popular question types, simple questions such as "Who is the author of Cinderella?", are still not completely solved. Surprisingly, even the most powerful modern Large Language Models are prone to errors when dealing with such questions, especially when dealing with rare entities. At the same time, as an answer may be one hop away from the question entity, one can try to develop a method that uses structured knowledge graphs (KGs) to answer such questions. In this paper, we introduce Konstruktor - an efficient and robust approach that breaks down the problem into three steps: (i) entity extraction and entity linking, (ii) relation prediction, and (iii) querying the knowledge graph. Our approach integrates language models and knowledge graphs, exploiting the power of the former and the interpretability of the latter. We experiment with two named entity recognition and entity linking methods and several relation detection techniques. We show that for relation detection, the most challenging step of the workflow, a combination of relation classification/generation and ranking outperforms other methods. We report Konstruktor's strong results on four datasets.
Title:
```
  DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection
```
Authors: Jiaxin Ye, Junping Zhang, Hongming Shan
Subjects: Subjects: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Depression is a common mental disorder that affects millions of people worldwide. Although promising, current multimodal methods hinge on aligned or aggregated multimodal fusion, suffering two significant limitations: (i) inefficient long-range temporal modeling, and (ii) sub-optimal multimodal fusion between intermodal fusion and intramodal processing. In this paper, we propose an audio-visual progressive fusion Mamba for multimodal depression detection, termed DepMamba. DepMamba features two core designs: hierarchical contextual modeling and progressive multimodal fusion. On the one hand, hierarchical modeling introduces convolution neural networks and Mamba to extract the local-to-global features within long-range sequences. On the other hand, the progressive fusion first presents a multimodal collaborative State Space Model (SSM) extracting intermodal and intramodal information for each modality, and then utilizes a multimodal enhanced SSM for modality cohesion. Extensive experimental results on two large-scale depression datasets demonstrate the superior performance of our DepMamba over existing state-of-the-art methods. Code is available at this https URL.
Title:
```
  ASD-Diffusion: Anomalous Sound Detection with Diffusion Models
```
Authors: Fengrun Zhang, Xiang Xie, Kai Guo
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Unsupervised Anomalous Sound Detection (ASD) aims to design a generalizable method that can be used to detect anomalies when only normal sounds are given. In this paper, Anomalous Sound Detection based on Diffusion Models (ASD-Diffusion) is proposed for ASD in real-world factories. In our pipeline, the anomalies in acoustic features are reconstructed from their noisy corrupted features into their approximate normal pattern. Secondly, a post-processing anomalies filter algorithm is proposed to detect anomalies that exhibit significant deviation from the original input after reconstruction. Furthermore, denoising diffusion implicit model is introduced to accelerate the inference speed by a longer sampling interval of the denoising process. The proposed method is innovative in the application of diffusion models as a new scheme. Experimental results on the development set of DCASE 2023 challenge task 2 outperform the baseline by 7.75%, demonstrating the effectiveness of the proposed method.
Title:
```
  Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection
```
Authors: Yunbo Long, Zhengyang Ling, Sam Brook, Duncan McFarlane, Alexandra Brintrup
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning ex- pertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining general- izability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers
Title:
```
  Exploring the Impact of Outlier Variability on Anomaly Detection Evaluation Metrics
```
Authors: Minjae Ok, Simon Klüttermann, Emmanuel Müller
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Anomaly detection is a dynamic field, in which the evaluation of models plays a critical role in understanding their effectiveness. The selection and interpretation of the evaluation metrics are pivotal, particularly in scenarios with varying amounts of anomalies. This study focuses on examining the behaviors of three widely used anomaly detection metrics under different conditions: F1 score, Receiver Operating Characteristic Area Under Curve (ROC AUC), and Precision-Recall Curve Area Under Curve (AUCPR). Our study critically analyzes the extent to which these metrics provide reliable and distinct insights into model performance, especially considering varying levels of outlier fractions and contamination thresholds in datasets. Through a comprehensive experimental setup involving widely recognized algorithms for anomaly detection, we present findings that challenge the conventional understanding of these metrics and reveal nuanced behaviors under varying conditions. We demonstrated that while the F1 score and AUCPR are sensitive to outlier fractions, the ROC AUC maintains consistency and is unaffected by such variability. Additionally, under conditions of a fixed outlier fraction in the test set, we observe an alignment between ROC AUC and AUCPR, indicating that the choice between these two metrics may be less critical in such scenarios. The results of our study contribute to a more refined understanding of metric selection and interpretation in anomaly detection, offering valuable insights for both researchers and practitioners in the field.
Title:
```
  Towards Robust Object Detection: Identifying and Removing Backdoors via Module Inconsistency Analysis
```
Authors: Xianda Zhang, Siyuan Liang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Object detection models, widely used in security-critical applications, are vulnerable to backdoor attacks that cause targeted misclassifications when triggered by specific patterns. Existing backdoor defense techniques, primarily designed for simpler models like image classifiers, often fail to effectively detect and remove backdoors in object detectors. We propose a backdoor defense framework tailored to object detection models, based on the observation that backdoor attacks cause significant inconsistencies between local modules' behaviors, such as the Region Proposal Network (RPN) and classification head. By quantifying and analyzing these inconsistencies, we develop an algorithm to detect backdoors. We find that the inconsistent module is usually the main source of backdoor behavior, leading to a removal method that localizes the affected module, resets its parameters, and fine-tunes the model on a small clean dataset. Extensive experiments with state-of-the-art two-stage object detectors show our method achieves a 90% improvement in backdoor removal rate over fine-tuning baselines, while limiting clean data accuracy loss to less than 4%. To the best of our knowledge, this work presents the first approach that addresses both the detection and removal of backdoors in two-stage object detection models, advancing the field of securing these complex systems against backdoor attacks.
Title:
```
  Open-World Object Detection with Instance Representation Learning
```
Authors: Sunoh Lee, Minsik Jeon, Jihong Min, Junwon Seo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract While humans naturally identify novel objects and understand their relationships, deep learning-based object detectors struggle to detect and relate objects that are not observed during training. To overcome this issue, Open World Object Detection(OWOD) has been introduced to enable models to detect unknown objects in open-world scenarios. However, OWOD methods fail to capture the fine-grained relationships between detected objects, which are crucial for comprehensive scene understanding and applications such as class discovery and tracking. In this paper, we propose a method to train an object detector that can both detect novel objects and extract semantically rich features in open-world conditions by leveraging the knowledge of Vision Foundation Models(VFM). We first utilize the semantic masks from the Segment Anything Model to supervise the box regression of unknown objects, ensuring accurate localization. By transferring the instance-wise similarities obtained from the VFM features to the detector's instance embeddings, our method then learns a semantically rich feature space of these embeddings. Extensive experiments show that our method learns a robust and generalizable feature space, outperforming other OWOD-based feature extraction methods. Additionally, we demonstrate that the enhanced feature from our model increases the detector's applicability to tasks such as open-world tracking.
Title:
```
  Leveraging Mixture of Experts for Improved Speech Deepfake Detection
```
Authors: Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Speech deepfakes pose a significant threat to personal security and content authenticity. Several detectors have been proposed in the literature, and one of the primary challenges these systems have to face is the generalization over unseen data to identify fake signals across a wide range of datasets. In this paper, we introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture. The Mixture of Experts framework is well-suited for the speech deepfake detection task due to its ability to specialize in different input types and handle data variability efficiently. This approach offers superior generalization and adaptability to unseen data compared to traditional single models or ensemble methods. Additionally, its modular structure supports scalable updates, making it more flexible in managing the evolving complexity of deepfake techniques while maintaining high detection accuracy. We propose an efficient, lightweight gating mechanism to dynamically assign expert weights for each input, optimizing detection performance. Experimental results across multiple datasets demonstrate the effectiveness and potential of our proposed approach.
Title:
```
  GS-Net: Global Self-Attention Guided CNN for Multi-Stage Glaucoma Classification
```
Authors: Dipankar Das, Deepak Ranjan Nayak
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Glaucoma is a common eye disease that leads to irreversible blindness unless timely detected. Hence, glaucoma detection at an early stage is of utmost importance for a better treatment plan and ultimately saving the vision. The recent literature has shown the prominence of CNN-based methods to detect glaucoma from retinal fundus images. However, such methods mainly focus on solving binary classification tasks and have not been thoroughly explored for the detection of different glaucoma stages, which is relatively challenging due to minute lesion size variations and high inter-class similarities. This paper proposes a global self-attention based network called GS-Net for efficient multi-stage glaucoma classification. We introduce a global self-attention module (GSAM) consisting of two parallel attention modules, a channel attention module (CAM) and a spatial attention module (SAM), to learn global feature dependencies across channel and spatial dimensions. The GSAM encourages extracting more discriminative and class-specific features from the fundus images. The experimental results on a publicly available dataset demonstrate that our GS-Net outperforms state-of-the-art methods. Also, the GSAM achieves competitive performance against popular attention modules.
Title:
```
  Neuromorphic Drone Detection: an Event-RGB Multimodal Approach
```
Authors: Gabriele Magrini, Federico Becattini, Pietro Pala, Alberto Del Bimbo, Antonio Porta
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In recent years, drone detection has quickly become a subject of extreme interest: the potential for fast-moving objects of contained dimensions to be used for malicious intents or even terrorist attacks has posed attention to the necessity for precise and resilient systems for detecting and identifying such elements. While extensive literature and works exist on object detection based on RGB data, it is also critical to recognize the limits of such modality when applied to UAVs detection. Detecting drones indeed poses several challenges such as fast-moving objects and scenes with a high dynamic range or, even worse, scarce illumination levels. Neuromorphic cameras, on the other hand, can retain precise and rich spatio-temporal information in situations that are challenging for RGB cameras. They are resilient to both high-speed moving objects and scarce illumination settings, while prone to suffer a rapid loss of information when the objects in the scene are static. In this context, we present a novel model for integrating both domains together, leveraging multimodal data to take advantage of the best of both worlds. To this end, we also release NeRDD (Neuromorphic-RGB Drone Detection), a novel spatio-temporally synchronized Event-RGB Drone detection dataset of more than 3.5 hours of multimodal annotated recordings.
Title:
```
  VisioPhysioENet: Multimodal Engagement Detection using Visual and Physiological Signals
```
Authors: Alakhsimar Singh, Nischay Verma, Kanav Goyal, Amritpal Singh, Puneet Kumar, Xiaobai Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper presents VisioPhysioENet, a novel multimodal system that leverages visual cues and physiological signals to detect learner engagement. It employs a two-level approach for visual feature extraction using the Dlib library for facial landmark extraction and the OpenCV library for further estimations. This is complemented by extracting physiological signals using the plane-orthogonal-to-skin method to assess cardiovascular activity. These features are integrated using advanced machine learning classifiers, enhancing the detection of various engagement levels. We rigorously evaluate VisioPhysioENet on the DAiSEE dataset, where it achieves an accuracy of 63.09%, demonstrating a superior ability to discern various levels of engagement compared to existing methodologies. The proposed system's code can be accessed at this https URL.
Title:
```
  HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection
```
Authors: Yuqi Ma, Mengyin Liu, Chao Zhu, Xu-Cheng Yin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. However, OVD models are pretrained on large-scale image-text pairs with rich attribute words, whose latent feature space can represent the global text feature as a linear composition of fine-grained attribute tokens without highlighting them. Therefore, we propose in this paper a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space. Firstly, a LLM is leveraged to highlight attribute words within the input text as a zero-shot prompted task. Secondly, by strategically adjusting the token masks, the text encoders of OVD models extract both global text and attribute-specific features, which are then explicitly composited as two vectors in linear space to form the new attribute-highlighted feature for detection tasks, where corresponding scalars are hand-crafted or learned to reweight both two vectors. Notably, these scalars can be seamlessly transferred among different OVD models, which proves that such an explicit linear composition is universal. Empirical evaluation on the FG-OVD dataset demonstrates that our proposed method uniformly improves fine-grained attribute-level OVD of various mainstream models and achieves new state-of-the-art performance.
Title:
```
  Seeing Faces in Things: A Model and Dataset for Pareidolia
```
Authors: Mark Hamilton, Simon Stent, Vasha DuTell, Anne Harrington, Jennifer Corbett, Ruth Rosenholtz, William T. Freeman
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. Face pareidolia'' describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective. We present an image dataset ofFaces in Things'', consisting of five thousand web images with human-annotated pareidolic faces. Using this dataset, we examine the extent to which a state-of-the-art human face detector exhibits pareidolia, and find a significant behavioral gap between humans and machines. We find that the evolutionary need for humans to detect animal faces, as well as human faces, may explain some of this gap. Finally, we propose a simple statistical model of pareidolia in images. Through studies on human subjects and our pareidolic face detectors we confirm a key prediction of our model regarding what image conditions are most likely to induce pareidolia. Dataset and Website: this https URL
Title:
```
  ComiCap: A VLMs pipeline for dense captioning of Comic Panels
```
Authors: Emanuele Vivoli, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-Language Models (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test set to fairly evaluate open-source VLMs and select the best captioning model according to our metric. Our pipeline generates dense captions with bounding boxes that are quantitatively and qualitatively superior to those produced by specifically trained models, without requiring any additional training. Using this pipeline, we annotated over 2 million panels across 13,000 books, which will be available on the project page this https URL.
Title:
```
  Segmentation Strategies in Deep Learning for Prostate Cancer Diagnosis: A Comparative Study of Mamba, SAM, and YOLO
```
Authors: Ali Badiezadeh, Amin Malekmohammadi, Seyed Mostafa Mirhassani, Parisa Gifani, Majid Vafaeezadeh
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Accurate segmentation of prostate cancer histopathology images is crucial for diagnosis and treatment planning. This study presents a comparative analysis of three deep learning-based methods, Mamba, SAM, and YOLO, for segmenting prostate cancer histopathology images. We evaluated the performance of these models on two comprehensive datasets, Gleason 2019 and SICAPv2, using Dice score, precision, and recall metrics. Our results show that the High-order Vision Mamba UNet (H-vmunet) model outperforms the other two models, achieving the highest scores across all metrics on both datasets. The H-vmunet model's advanced architecture, which integrates high-order visual state spaces and 2D-selective-scan operations, enables efficient and sensitive lesion detection across different scales. Our study demonstrates the potential of the H-vmunet model for clinical applications and highlights the importance of robust validation and comparison of deep learning-based methods for medical image analysis. The findings of this study contribute to the development of accurate and reliable computer-aided diagnosis systems for prostate cancer. The code is available at this http URL.
Title:
```
  LLMCount: Enhancing Stationary mmWave Detection with Multimodal-LLM
```
Authors: Boyan Li, Shengyi Ding, Deen Ma, Yixuan Wu, Hongjie Liao, Kaiyuan Hu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Millimeter wave sensing provides people with the capability of sensing the surrounding crowds in a non-invasive and privacy-preserving manner, which holds huge application potential. However, detecting stationary crowds remains challenging due to several factors such as minimal movements (like breathing or casual fidgets), which can be easily treated as noise clusters during data collection and consequently filtered in the following processing procedures. Additionally, the uneven distribution of signal power due to signal power attenuation and interferences resulting from external reflectors or absorbers further complicates accurate detection. To address these challenges and enable stationary crowd detection across various application scenarios requiring specialized domain adaption, we introduce LLMCount, the first system to harness the capabilities of large-language models (LLMs) to enhance crowd detection performance. By exploiting the decision-making capability of LLM, we can successfully compensate the signal power to acquire a uniform distribution and thereby achieve a detection with higher accuracy. To assess the system's performance, comprehensive evaluations are conducted under diversified scenarios like hall, meeting room, and cinema. The evaluation results show that our proposed approach reaches high detection accuracy with lower overall latency compared with previous methods.
Title:
```
  Tiny Robotics Dataset and Benchmark for Continual Object Detection
```
Authors: Francesco Pasti, Riccardo De Monte, Davide Dalle Pezze, Gian Antonio Susto, Nicola Bellotto
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Detecting objects in mobile robotics is crucial for numerous applications, from autonomous navigation to inspection. However, robots are often required to perform tasks in different domains with respect to the training one and need to adapt to these changes. Tiny mobile robots, subject to size, power, and computational constraints, encounter even more difficulties in running and adapting these algorithms. Such adaptability, though, is crucial for real-world deployment, where robots must operate effectively in dynamic and unpredictable settings. In this work, we introduce a novel benchmark to evaluate the continual learning capabilities of object detection systems in tiny robotic platforms. Our contributions include: (i) Tiny Robotics Object Detection (TiROD), a comprehensive dataset collected using a small mobile robot, designed to test the adaptability of object detectors across various domains and classes; (ii) an evaluation of state-of-the-art real-time object detectors combined with different continual learning strategies on this dataset, providing detailed insights into their performance and limitations; and (iii) we publish the data and the code to replicate the results to foster continuous advancements in this field. Our benchmark results indicate key challenges that must be addressed to advance the development of robust and efficient object detection systems for tiny robotics.
Title:
```
  VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection
```
Authors: Sunghyun Ahn, Youngwan Jo, Kijung Lee, Sanghyun Park
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Video anomaly detection (VAD) is a crucial task in video analysis and surveillance within computer vision. Currently, VAD is gaining attention with memory techniques that store the features of normal frames. The stored features are utilized for frame reconstruction, identifying an abnormality when a significant difference exists between the reconstructed and input frames. However, this approach faces several challenges due to the simultaneous optimization required for both the memory and encoder-decoder model. These challenges include increased optimization difficulty, complexity of implementation, and performance variability depending on the memory size. To address these challenges,we propose an effective memory method for VAD, called VideoPatchCore. Inspired by PatchCore, our approach introduces a structure that prioritizes memory optimization and configures three types of memory tailored to the characteristics of video data. This method effectively addresses the limitations of existing memory-based methods, achieving good performance comparable to state-of-the-art methods. Furthermore, our method requires no training and is straightforward to implement, making VAD tasks more accessible. Our code is available online at this http URL.
Title:
```
  Low-degree Security of the Planted Random Subgraph Problem
```
Authors: Andrej Bogdanov, Chris Jones, Alon Rosen, Ilias Zadik
Subjects: Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The planted random subgraph detection conjecture of Abram et al. (TCC 2023) asserts the pseudorandomness of a pair of graphs $(H, G)$, where $G$ is an Erdos-Renyi random graph on $n$ vertices, and $H$ is a random induced subgraph of $G$ on $k$ vertices. Assuming the hardness of distinguishing these two distributions (with two leaked vertices), Abram et al. construct communication-efficient, computationally secure (1) 2-party private simultaneous messages (PSM) and (2) secret sharing for forbidden graph structures. We prove the low-degree hardness of detecting planted random subgraphs all the way up to $k\leq n^{1 - \Omega(1)}$. This improves over Abram et al.'s analysis for $k \leq n^{1/2 - \Omega(1)}$. The hardness extends to $r$-uniform hypergraphs for constant $r$. Our analysis is tight in the distinguisher's degree, its advantage, and in the number of leaked vertices. Extending the constructions of Abram et al, we apply the conjecture towards (1) communication-optimal multiparty PSM protocols for random functions and (2) bit secret sharing with share size $(1 + \epsilon)\log n$ for any $\epsilon > 0$ in which arbitrary minimal coalitions of up to $r$ parties can reconstruct and secrecy holds against all unqualified subsets of up to $\ell = o(\epsilon \log n)^{1/(r-1)}$ parties.
Title:
```
  Predicting Deterioration in Mild Cognitive Impairment with Survival Transformers, Extreme Gradient Boosting and Cox Proportional Hazard Modelling
```
Authors: Henry Musto, Daniel Stamate, Doina Logofatu, Daniel Stahl
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The paper proposes a novel approach of survival transformers and extreme gradient boosting models in predicting cognitive deterioration in individuals with mild cognitive impairment (MCI) using metabolomics data in the ADNI cohort. By leveraging advanced machine learning and transformer-based techniques applied in survival analysis, the proposed approach highlights the potential of these techniques for more accurate early detection and intervention in Alzheimer's dementia disease. This research also underscores the importance of non-invasive biomarkers and innovative modelling tools in enhancing the accuracy of dementia risk assessments, offering new avenues for clinical practice and patient care. A comprehensive Monte Carlo simulation procedure consisting of 100 repetitions of a nested cross-validation in which models were trained and evaluated, indicates that the survival machine learning models based on Transformer and XGBoost achieved the highest mean C-index performances, namely 0.85 and 0.8, respectively, and that they are superior to the conventional survival analysis Cox Proportional Hazards model which achieved a mean C-Index of 0.77. Moreover, based on the standard deviations of the C-Index performances obtained in the Monte Carlo simulation, we established that both survival machine learning models above are more stable than the conventional statistical model.
Keyword: face recognition

Title:
```
  Adversarial Watermarking for Face Recognition
```
Authors: Yuguang Yao, Anil Jain, Sijia Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Watermarking is an essential technique for embedding an identifier (i.e., watermark message) within digital images to assert ownership and monitor unauthorized alterations. In face recognition systems, watermarking plays a pivotal role in ensuring data integrity and security. However, an adversary could potentially interfere with the watermarking process, significantly impairing recognition performance. We explore the interaction between watermarking and adversarial attacks on face recognition models. Our findings reveal that while watermarking or input-level perturbation alone may have a negligible effect on recognition accuracy, the combined effect of watermarking and perturbation can result in an adversarial watermarking attack, significantly degrading recognition performance. Specifically, we introduce a novel threat model, the adversarial watermarking attack, which remains stealthy in the absence of watermarking, allowing images to be correctly recognized initially. However, once watermarking is applied, the attack is activated, causing recognition failures. Our study reveals a previously unrecognized vulnerability: adversarial perturbations can exploit the watermark message to evade face recognition systems. Evaluated on the CASIA-WebFace dataset, our proposed adversarial watermarking attack reduces face matching accuracy by 67.2% with an $\ell_\infty$ norm-measured perturbation strength of ${2}/{255}$ and by 95.9% with a strength of ${4}/{255}$.
Title:
```
  From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing
```
Authors: Ivan DeAndres-Tame, Muhammad Faisal, Ruben Tolosana, Rouqaiah Al-Refai, Ruben Vera-Rodriguez, Philipp Terhörst
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Face Recognition (FR) has advanced significantly with the development of deep learning, achieving high accuracy in several applications. However, the lack of interpretability of these systems raises concerns about their accountability, fairness, and reliability. In the present study, we propose an interactive framework to enhance the explainability of FR models by combining model-agnostic Explainable Artificial Intelligence (XAI) and Natural Language Processing (NLP) techniques. The proposed framework is able to accurately answer various questions of the user through an interactive chatbot. In particular, the explanations generated by our proposed method are in the form of natural language text and visual representations, which for example can describe how different facial regions contribute to the similarity measure between two faces. This is achieved through the automatic analysis of the output's saliency heatmaps of the face images and a BERT question-answering model, providing users with an interface that facilitates a comprehensive understanding of the FR decisions. The proposed approach is interactive, allowing the users to ask questions to get more precise information based on the user's background knowledge. More importantly, in contrast to previous studies, our solution does not decrease the face recognition performance. We demonstrate the effectiveness of the method through different experiments, highlighting its potential to make FR systems more interpretable and user-friendly, especially in sensitive applications where decision-making transparency is crucial.
Keyword: augmentation

Title:
```
  Revisiting the Solution of Meta KDD Cup 2024: CRAG
```
Authors: Jie Ouyang, Yucong Luo, Mingyue Cheng, Daoyu Wang, Shuo Yu, Qi Liu, Enhong Chen
Subjects: Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper presents the solution of our team APEX in the Meta KDD CUP 2024: CRAG Comprehensive RAG Benchmark Challenge. The CRAG benchmark addresses the limitations of existing QA benchmarks in evaluating the diverse and dynamic challenges faced by Retrieval-Augmented Generation (RAG) systems. It provides a more comprehensive assessment of RAG performance and contributes to advancing research in this field. We propose a routing-based domain and dynamic adaptive RAG pipeline, which performs specific processing for the diverse and dynamic nature of the question in all three stages: retrieval, augmentation, and generation. Our method achieved superior performance on CRAG and ranked 2nd for Task 2&3 on the final competition leaderboard. Our implementation is available at this link: this https URL.
Title:
```
  ControlMath: Controllable Data Generation Promotes Math Generalist Models
```
Authors: Nuo Chen, Ning Wu, Jianhui Chang, Jia Li
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Utilizing large language models (LLMs) for data augmentation has yielded encouraging results in mathematical reasoning. However, these approaches face constraints in problem diversity, potentially restricting them to in-domain/distribution data generation. To this end, we propose ControlMath, an iterative method involving an equation-generator module and two LLM-based agents. The module creates diverse equations, which the Problem-Crafter agent then transforms into math word problems. The Reverse-Agent filters and selects high-quality data, adhering to the "less is more" principle, achieving better results with fewer data points. This approach enables the generation of diverse math problems, not limited to specific domains or distributions. As a result, we collect ControlMathQA, which involves 190k math word problems. Extensive results prove that combining our dataset with in-domain datasets like GSM8K can help improve the model's mathematical ability to generalize, leading to improved performances both within and beyond specific domains.
Title:
```
  Mixing Data-driven and Geometric Models for Satellite Docking Port State Estimation using an RGB or Event Camera
```
Authors: Cedric Le Gentil, Jack Naylor, Nuwan Munasinghe, Jasprabhjit Mehami, Benny Dai, Mikhail Asavkin, Donald G. Dansereau, Teresa Vidal-Calleja
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In-orbit automated servicing is a promising path towards lowering the cost of satellite operations and reducing the amount of orbital debris. For this purpose, we present a pipeline for automated satellite docking port detection and state estimation using monocular vision data from standard RGB sensing or an event camera. Rather than taking snapshots of the environment, an event camera has independent pixels that asynchronously respond to light changes, offering advantages such as high dynamic range, low power consumption and latency, etc. This work focuses on satellite-agnostic operations (only a geometric knowledge of the actual port is required) using the recently released Lockheed Martin Mission Augmentation Port (LM-MAP) as the target. By leveraging shallow data-driven techniques to preprocess the incoming data to highlight the LM-MAP's reflective navigational aids and then using basic geometric models for state estimation, we present a lightweight and data-efficient pipeline that can be used independently with either RGB or event cameras. We demonstrate the soundness of the pipeline and perform a quantitative comparison of the two modalities based on data collected with a photometrically accurate test bench that includes a robotic arm to simulate the target satellite's uncontrolled motion.
Title:
```
  Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI
```
Authors: Liang Zhang, Jionghao Lin, John Sabatini, Conrad Borchers, Daniel Weitekamp, Meng Cao, John Hollander, Xiangen Hu, Arthur C. Graesser
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Learning performance data describe correct and incorrect answers or problem-solving attempts in adaptive learning, such as in intelligent tutoring systems (ITSs). Learning performance data tend to be highly sparse (80\%(\sim)90\% missing observations) in most real-world applications due to adaptive item selection. This data sparsity presents challenges to using learner models to effectively predict future performance explore new hypotheses about learning. This article proposes a systematic framework for augmenting learner data to address data sparsity in learning performance data. First, learning performance is represented as a three-dimensional tensor of learners' questions, answers, and attempts, capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing tasks that predict missing performance values based on real observations. Third, a module for generating patterns of learning is used. This study contrasts two forms of generative Artificial Intelligence (AI), including Generative Adversarial Networks (GANs) and Generate Pre-Trained Transformers (GPT) to generate data associated with different clusters of learner data. We tested this approach on an adult literacy dataset from AutoTutor lessons developed for Adult Reading Comprehension (ARC). We found that: (1) tensor factorization improved the performance in tracing and predicting knowledge mastery compared with other knowledge tracing techniques without data augmentation, showing higher relative fidelity for this imputation method, and (2) the GAN-based simulation showed greater overall stability and less statistical bias based on a divergence evaluation with varying simulation sample sizes compared to GPT.
Title:
```
  3D-JEPA: A Joint Embedding Predictive Architecture for 3D Self-Supervised Representation Learning
```
Authors: Naiwen Hu, Haozhe Cheng, Yifan Xie, Shiqi Li, Jihua Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Invariance-based and generative methods have shown a conspicuous performance for 3D self-supervised representation learning (SSRL). However, the former relies on hand-crafted data augmentations that introduce bias not universally applicable to all downstream tasks, and the latter indiscriminately reconstructs masked regions, resulting in irrelevant details being saved in the representation space. To solve the problem above, we introduce 3D-JEPA, a novel non-generative 3D SSRL framework. Specifically, we propose a multi-block sampling strategy that produces a sufficiently informative context block and several representative target blocks. We present the context-aware decoder to enhance the reconstruction of the target blocks. Concretely, the context information is fed to the decoder continuously, facilitating the encoder in learning semantic modeling rather than memorizing the context information related to target blocks. Overall, 3D-JEPA predicts the representation of target blocks from a context block using the encoder and context-aware decoder architecture. Various downstream tasks on different datasets demonstrate 3D-JEPA's effectiveness and efficiency, achieving higher accuracy with fewer pretraining epochs, e.g., 88.65% accuracy on PB_T50_RS with 150 pretraining epochs.
Title:
```
  Twin Network Augmentation: A Novel Training Strategy for Improved Spiking Neural Networks and Efficient Weight Quantization
```
Authors: Lucas Deckers, Benjamin Vandersmissen, Ing Jyh Tsang, Werner Van Leekwijck, Steven Latré
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The proliferation of Artificial Neural Networks (ANNs) has led to increased energy consumption, raising concerns about their sustainability. Spiking Neural Networks (SNNs), which are inspired by biological neural systems and operate using sparse, event-driven spikes to communicate information between neurons, offer a potential solution due to their lower energy requirements. An alternative technique for reducing a neural network's footprint is quantization, which compresses weight representations to decrease memory usage and energy consumption. In this study, we present Twin Network Augmentation (TNA), a novel training framework aimed at improving the performance of SNNs while also facilitating an enhanced compression through low-precision quantization of weights. TNA involves co-training an SNN with a twin network, optimizing both networks to minimize their cross-entropy losses and the mean squared error between their output logits. We demonstrate that TNA significantly enhances classification performance across various vision datasets and in addition is particularly effective when applied when reducing SNNs to ternary weight precision. Notably, during inference , only the ternary SNN is retained, significantly reducing the network in number of neurons, connectivity and weight size representation. Our results show that TNA outperforms traditional knowledge distillation methods and achieves state-of-the-art performance for the evaluated network architecture on benchmark datasets, including CIFAR-10, CIFAR-100, and CIFAR-10-DVS. This paper underscores the effectiveness of TNA in bridging the performance gap between SNNs and ANNs and suggests further exploration into the application of TNA in different network architectures and datasets.
Title:
```
  Adversarial Backdoor Defense in CLIP
```
Authors: Junhao Kuang, Siyuan Liang, Jiawei Liang, Kuanrong Liu, Xiaochun Cao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multimodal contrastive pretraining, exemplified by models like CLIP, has been found to be vulnerable to backdoor attacks. While current backdoor defense methods primarily employ conventional data augmentation to create augmented samples aimed at feature alignment, these methods fail to capture the distinct features of backdoor samples, resulting in suboptimal defense performance. Observations reveal that adversarial examples and backdoor samples exhibit similarities in the feature space within the compromised models. Building on this insight, we propose Adversarial Backdoor Defense (ABD), a novel data augmentation strategy that aligns features with meticulously crafted adversarial examples. This approach effectively disrupts the backdoor association. Our experiments demonstrate that ABD provides robust defense against both traditional uni-modal and multimodal backdoor attacks targeting CLIP. Compared to the current state-of-the-art defense method, CleanCLIP, ABD reduces the attack success rate by 8.66% for BadNet, 10.52% for Blended, and 53.64% for BadCLIP, while maintaining a minimal average decrease of just 1.73% in clean accuracy.
Title:
```
  Unleashing the Potential of Synthetic Images: A Study on Histopathology Image Classification
```
Authors: Leire Benito-Del-Valle, Aitor Alvarez-Gila, Itziar Eguskiza, Cristina L. Saratxaga
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Histopathology image classification is crucial for the accurate identification and diagnosis of various diseases but requires large and diverse datasets. Obtaining such datasets, however, is often costly and time-consuming due to the need for expert annotations and ethical constraints. To address this, we examine the suitability of different generative models and image selection approaches to create realistic synthetic histopathology image patches conditioned on class labels. Our findings highlight the importance of selecting an appropriate generative model type and architecture to enhance performance. Our experiments over the PCam dataset show that diffusion models are effective for transfer learning, while GAN-generated samples are better suited for augmentation. Additionally, transformer-based generative models do not require image filtering, in contrast to those derived from Convolutional Neural Networks (CNNs), which benefit from realism score-based selection. Therefore, we show that synthetic images can effectively augment existing datasets, ultimately improving the performance of the downstream histopathology image classification task.
Title:
```
  Machine learning approaches for automatic defect detection in photovoltaic systems
```
Authors: Swayam Rajat Mohanty, Moin Uddin Maruf, Vaibhav Singh, Zeeshan Ahmad
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Solar photovoltaic (PV) modules are prone to damage during manufacturing, installation and operation which reduces their power conversion efficiency. This diminishes their positive environmental impact over the lifecycle. Continuous monitoring of PV modules during operation via unmanned aerial vehicles is essential to ensure that defective panels are promptly replaced or repaired to maintain high power conversion efficiencies. Computer vision provides an automatic, non-destructive and cost-effective tool for monitoring defects in large-scale PV plants. We review the current landscape of deep learning-based computer vision techniques used for detecting defects in solar modules. We compare and evaluate the existing approaches at different levels, namely the type of images used, data collection and processing method, deep learning architectures employed, and model interpretability. Most approaches use convolutional neural networks together with data augmentation or generative adversarial network-based techniques. We evaluate the deep learning approaches by performing interpretability analysis on classification tasks. This analysis reveals that the model focuses on the darker regions of the image to perform the classification. We find clear gaps in the existing approaches while also laying out the groundwork for mitigating these challenges when building new models. We conclude with the relevant research gaps that need to be addressed and approaches for progress in this field: integrating geometric deep learning with existing approaches for building more robust and reliable models, leveraging physics-based neural networks that combine domain expertise of physical laws to build more domain-aware deep learning models, and incorporating interpretability as a factor for building models that can be trusted. The review points towards a clear roadmap for making this technology commercially relevant.
Title:
```
  TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models
```
Authors: Andrei Margeloiu, Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Data collection is often difficult in critical fields such as medicine, physics, and chemistry. As a result, classification methods usually perform poorly with these small datasets, leading to weak predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream classification performance. However, current tabular generative methods that learn either the joint distribution $ p(\mathbf{x}, y) $ or the class-conditional distribution $ p(\mathbf{x} \mid y) $ often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs). Unlike existing methods that use a shared model to approximate all class-conditional densities, our key innovation is to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually. This approach creates robust energy landscapes, even in ambiguous class distributions. Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently improves the classification performance across diverse datasets of various sizes, especially small ones.
Title:
```
  Label-Augmented Dataset Distillation
```
Authors: Seoungyoon Kang, Youngsun Lim, Hyunjung Shim
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Traditional dataset distillation primarily focuses on image representation while often overlooking the important role of labels. In this study, we introduce Label-Augmented Dataset Distillation (LADD), a new dataset distillation framework enhancing dataset distillation with label augmentations. LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics. These dense labels require only a 2.5% increase in storage (ImageNet subsets) with significant performance benefits, providing strong learning signals. Our label generation strategy can complement existing dataset distillation methods for significantly enhancing their training efficiency and performance. Experimental results demonstrate that LADD outperforms existing methods in terms of computational overhead and accuracy. With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy. Furthermore, the effectiveness of our method is proven across various datasets, distillation hyperparameters, and algorithms. Finally, our method improves the cross-architecture robustness of the distilled dataset, which is important in the application scenario.
Title:
```
  Self-Supervised Any-Point Tracking by Contrastive Random Walks
```
Authors: Ayush Shrivastava, Andrew Owens
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.

LeeKyungwook / get-arxiv-noti

Showing new listings for Wednesday, 25 September 2024 #1288

Keyword: detection

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: face recognition

Title:

Title:

Keyword: augmentation

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title: