New submissions for Thursday, 29 August 2024 (showing 325 of 325 entries )

Keyword: detection

Title:

      An Edge AI System Based on FPGA Platform for Railway Fault Detection

Authors: Jiale Li, Yulin Fu, Dongwei Yan, Sean Longyu Ma, Chiu-Wing Sham
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract As the demands for railway transportation safety increase, traditional methods of rail track inspection no longer meet the needs of modern railway systems. To address the issues of automation and efficiency in rail fault detection, this study introduces a railway inspection system based on Field Programmable Gate Array (FPGA). This edge AI system collects track images via cameras and uses Convolutional Neural Networks (CNN) to perform real-time detection of track defects and automatically reports fault information. The innovation of this system lies in its high level of automation and detection efficiency. The neural network approach employed by this system achieves a detection accuracy of 88.9%, significantly enhancing the reliability and efficiency of detection. Experimental results demonstrate that this FPGA-based system is 1.39 and 4.67 better in energy efficiency than peer implementation on the GPU and CPU platform, respectively.
Title:
```
  Situational Awareness in Indian Power Grid using Synchrophasor Data
```
Authors: Makarand Sudhakar Ballal (Dept. of Electrical Engineering, VNIT, Nagpur)
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Wide Area Measurement Systems (WAMS) can guide system operators' to increase their situational awareness by expanding observability of their supervise area and adjoining systems. Power system oscillations in the electrical grid are a matter of contention. It may transform the stable operation of the power system into an unstable system when oscillations grow and if no corrective actions are taken to mitigate them. This paper manifests the utilization of the synchrophasor technology for situational awareness by oscillation detection. An algorithm is proposed for the low-frequency oscillation-related analysis based on Prony and Fast Fourier Transform (FFT) methods. It gives better insight during the grid disturbance and other events related to power system operation. The effectiveness of this algorithm is validated by two case studies in the Western Regional Indian power grid.
Title:
```
  Multitask Fine-Tuning and Generative Adversarial Learning for Improved Auxiliary Classification
```
Authors: Christopher Sun, Abishek Satish
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this study, we implement a novel BERT architecture for multitask fine-tuning on three downstream tasks: sentiment classification, paraphrase detection, and semantic textual similarity prediction. Our model, Multitask BERT, incorporates layer sharing and a triplet architecture, custom sentence pair tokenization, loss pairing, and gradient surgery. Such optimizations yield a 0.516 sentiment classification accuracy, 0.886 paraphase detection accuracy, and 0.864 semantic textual similarity correlation on test data. We also apply generative adversarial learning to BERT, constructing a conditional generator model that maps from latent space to create fake embeddings in $\mathbb{R}^{768}$. These fake embeddings are concatenated with real BERT embeddings and passed into a discriminator model for auxiliary classification. Using this framework, which we refer to as AC-GAN-BERT, we conduct semi-supervised sensitivity analyses to investigate the effect of increasing amounts of unlabeled training data on AC-GAN-BERT's test accuracy. Overall, aside from implementing a high-performing multitask classification system, our novelty lies in the application of adversarial learning to construct a generator that mimics BERT. We find that the conditional generator successfully produces rich embeddings with clear spatial correlation with class labels, demonstrating avoidance of mode collapse. Our findings validate the GAN-BERT approach and point to future directions of generator-aided knowledge distillation.
Title:
```
  Multi-Class Plant Leaf Disease Detection: A CNN-based Approach with Mobile App Integration
```
Authors: Md Aziz Hosen Foysal, Foyez Ahmed, Md Zahurul Haque
Subjects: Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Plant diseases significantly impact agricultural productivity, resulting in economic losses and food insecurity. Prompt and accurate detection is crucial for the efficient management and mitigation of plant diseases. This study investigates advanced techniques in plant disease detection, emphasizing the integration of image processing, machine learning, deep learning methods, and mobile technologies. High-resolution images of plant leaves were captured and analyzed using convolutional neural networks (CNNs) to detect symptoms of various diseases, such as blight, mildew, and rust. This study explores 14 classes of plants and diagnoses 26 unique plant diseases. We focus on common diseases affecting various crops. The model was trained on a diverse dataset encompassing multiple crops and disease types, achieving 98.14% accuracy in disease diagnosis. Finally integrated this model into mobile apps for real-time disease diagnosis.
Title:
```
  CrossInspector: A Static Analysis Approach for Cross-Contract Vulnerability Detection
```
Authors: Xiao Chen
Subjects: Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract With the development of blockchain technology, the detection of smart contract vulnerabilities is increasingly emphasized. However, when detecting vulnerabilities in inter-contract interactions (i.e., cross-contract vulnerabilities) using smart contract bytecode, existing tools often produce many false positives and false negatives due to insufficient recovery of semantic information and inadequate consideration of contract dependencies. We present CrossInspector, a novel framework for detecting cross-contract vulnerabilities at the bytecode level through static analysis. CrossInspector utilizes a trained Transformer model to recover semantic information and considers control flow, data flow, and dependencies related to smart contract state variables to construct a state dependency graph for fine-grained inter-procedural analysis. Additionally, CrossInspector incorporates a pruning method and two parallel optimization mechanisms to accelerate the vulnerability detection process. Experiments on our manually constructed dataset demonstrate that CrossInspector outperforms the state-of-the-art tools in both precision (97\%) and recall (96.75\%), while also significantly reducing the overall time from 16.34 seconds to 7.83 seconds, almost on par with the fastest tool that utilizes bytecode for detection. Additionally, we ran CrossInspector on a randomly selected set of 300 real-world smart contracts and identified 11 cross-contract vulnerabilities that were missed by prior tools.
Title:
```
  DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model DualKanbaFormer: Kolmogorov-Arnold Networks and State Space Model Transformer for Multimodal Aspect-based Sentiment Analysis
```
Authors: Adamu Lawan, Juhua Pu, Haruna Yunusa, Muhammad Lawan, Aliyu Umar, Adamu Sani Yahya
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multimodal aspect-based sentiment analysis (MABSA) enhances sentiment detection by combining text with other data types like images. However, despite setting significant benchmarks, attention mechanisms exhibit limitations in efficiently modelling long-range dependencies between aspect and opinion targets within the text. They also face challenges in capturing global-context dependencies for visual representations. To this end, we propose Kolmogorov-Arnold Networks (KANs) and Selective State Space model (Mamba) transformer (DualKanbaFormer), a novel architecture to address the above issues. We leverage the power of Mamba to capture global context dependencies, Multi-head Attention (MHA) to capture local context dependencies, and KANs to capture non-linear modelling patterns for both textual representations (textual KanbaFormer) and visual representations (visual KanbaFormer). Furthermore, we fuse the textual KanbaFormer and visual KanbaFomer with a gated fusion layer to capture the inter-modality dynamics. According to extensive experimental results, our model outperforms some state-of-the-art (SOTA) studies on two public datasets.
Title:
```
  Fast and Modular Autonomy Software for Autonomous Racing Vehicles
```
Authors: Andrew Saba, Aderotimi Adetunji, Adam Johnson, Aadi Kothari, Matthew Sivaprakasam, Joshua Spisak, Prem Bharatia, Arjun Chauhan, Brendan Duff Jr., Noah Gasparro, Charles King, Ryan Larkin, Brian Mao, Micah Nye, Anjali Parashar, Joseph Attias, Aurimas Balciunas, Austin Brown, Chris Chang, Ming Gao, Cindy Heredia, Andrew Keats, Jose Lavariega, William Muckelroy III, Andre Slavescu, Nickolas Stathas, Nayana Suvarna, Chuan Tian Zhang, Sebastian Scherer, Deva Ramanan
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Autonomous motorsports aim to replicate the human racecar driver with software and sensors. As in traditional motorsports, Autonomous Racing Vehicles (ARVs) are pushed to their handling limits in multi-agent scenarios at extremely high ($\geq 150mph$) speeds. This Operational Design Domain (ODD) presents unique challenges across the autonomy stack. The Indy Autonomous Challenge (IAC) is an international competition aiming to advance autonomous vehicle development through ARV competitions. While far from challenging what a human racecar driver can do, the IAC is pushing the state of the art by facilitating full-sized ARV competitions. This paper details the MIT-Pitt-RW Team's approach to autonomous racing in the IAC. In this work, we present our modular and fast approach to agent detection, motion planning and controls to create an autonomy stack. We also provide analysis of the performance of the software stack in single and multi-agent scenarios for rapid deployment in a fast-paced competition environment. We also cover what did and did not work when deployed on a physical system the Dallara AV-21 platform and potential improvements to address these shortcomings. Finally, we convey lessons learned and discuss limitations and future directions for improvement.
Title:
```
  HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles
```
Authors: Deyuan Qu, Qi Chen, Yongqi Zhu, Yihao Zhu, Sergei S. Avedisov, Song Fu, Qing Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical detection models. Our goal is to develop a solution that supports cooperative perception across vehicles equipped with different modalities of sensors. This method aims to deliver improved perception performance compared to late fusion techniques, while achieving precision similar to the state-of-art intermediate fusion, but requires an order of magnitude less bandwidth. We propose HEAD, a method that fuses features from the classification and regression heads in 3D object detection networks. Our method is compatible with heterogeneous detection networks such as LiDAR PointPillars, SECOND, VoxelNet, and camera Bird's-eye View (BEV) Encoder. Given the naturally smaller feature size in the detection heads, we design a self-attention mechanism to fuse the classification head and a complementary feature fusion layer to fuse the regression head. Our experiments, comprehensively evaluated on the V2V4Real and OPV2V datasets, demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance.
Title:
```
  NAS-BNN: Neural Architecture Search for Binary Neural Networks
```
Authors: Zhihao Lin, Yongtao Wang, Jinhe Zhang, Xiaojie Chu, Haibin Ling
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Binary Neural Networks (BNNs) have gained extensive attention for their superior inferencing efficiency and compression ratio compared to traditional full-precision networks. However, due to the unique characteristics of BNNs, designing a powerful binary architecture is challenging and often requires significant manpower. A promising solution is to utilize Neural Architecture Search (NAS) to assist in designing BNNs, but current NAS methods for BNNs are relatively straightforward and leave a performance gap between the searched models and manually designed ones. To address this gap, we propose a novel neural architecture search scheme for binary neural networks, named NAS-BNN. We first carefully design a search space based on the unique characteristics of BNNs. Then, we present three training strategies, which significantly enhance the training of supernet and boost the performance of all subnets. Our discovered binary model family outperforms previous BNNs for a wide range of operations (OPs) from 20M to 200M. For instance, we achieve 68.20% top-1 accuracy on ImageNet with only 57M OPs. In addition, we validate the transferability of these searched BNNs on the object detection task, and our binary detectors with the searched BNNs achieve a novel state-of-the-art result, e.g., 31.6% mAP with 370M OPs, on MS COCO dataset. The source code and models will be released at this https URL.
Title:
```
  RoboSense: Large-scale Dataset and Benchmark for Multi-sensor Low-speed Autonomous Driving
```
Authors: Haisheng Su, Feixiang Song, Cong Ma, Panpan Cai, Wei Wu, Cewu Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Robust object detection and tracking under arbitrary sight of view is challenging yet essential for the development of Autonomous Vehicle technology. With the growing demand of unmanned function vehicles, near-field scene understanding becomes an important research topic in the areas of low-speed autonomous driving. Due to the complexity of driving conditions and diversity of near obstacles such as blind spots and high occlusion, the perception capability of near-field environment is still inferior than its farther counterpart. To further enhance the intelligent ability of unmanned vehicles, in this paper, we construct a multimodal data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view for ego vehicle, either global view or local view. Meanwhile, a large-scale multi-sensor dataset is built, named RoboSense, to facilitate near-field scene understanding. RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full $360^{\circ}$ view, forming 216K trajectories across 7.6K temporal sequences. It has $270\times$ and $18\times$ as many annotations of near-field obstacles within 5$m$ as the previous single-vehicle datasets such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future development of related research, where the detailed data analysis as well as benchmarks are also provided accordingly.
Title:
```
  Depth-Weighted Detection of Behaviours of Risk in People with Dementia using Cameras
```
Authors: Pratik K. Mishra, Irene Ballester, Andrea Iaboni, Bing Ye, Kristine Newman, Alex Mihailidis, Shehroz S. Khan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The behavioural and psychological symptoms of dementia, such as agitation and aggression, present a significant health and safety risk in residential care settings. Many care facilities have video cameras in place for digital monitoring of public spaces, which can be leveraged to develop an automated behaviours of risk detection system that can alert the staff to enable timely intervention and prevent the situation from escalating. However, one of the challenges in our previous study was the presence of false alarms due to obstruction of view by activities happening close to the camera. To address this issue, we proposed a novel depth-weighted loss function to train a customized convolutional autoencoder to enforce equivalent importance to the events happening both near and far from the cameras; thus, helping to reduce false alarms and making the method more suitable for real-world deployment. The proposed method was trained using data from nine participants with dementia across three cameras situated in a specialized dementia unit and achieved an area under the curve of receiver operating characteristic of $0.852$, $0.81$ and $0.768$ for the three cameras. Ablation analysis was conducted for the individual components of the proposed method and the performance of the proposed method was investigated for participant-specific and sex-specific behaviours of risk detection. The proposed method performed reasonably well in detecting behaviours of risk in people with dementia motivating further research toward the development of a behaviours of risk detection system suitable for deployment in video surveillance systems in care facilities.
Title:
```
  ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
```
Authors: Lifan Jiang, Zhihui Wang, Siqi Yin, Guangxiao Ma, Peng Zhang, Boxi Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at this https URL.
Title:
```
  Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems
```
Authors: Farzaneh Dehghani (1), Mahsa Dibaji (2), Fahim Anzum (3), Lily Dey (3), Alican Basdemir (4), Sayeh Bayat (1,5), Jean-Christophe Boucher (6), Steve Drew (2), Sarah Elaine Eaton (7), Richard Frayne (8), Gouri Ginde (2), Ashley Harris (8), Yani Ioannou (2), Catherine Lebel (8), John Lysack (8), Leslie Salgado Arzuaga (9), Emma Stanley (1), Roberto Souza (2), Ronnie Souza (2), Lana Wells (10), Tyler Williamson (11), Matthias Wilms (8), Zaman Wahid (3), Mark Ungrin (12), Marina Gavrilova (3), Mariana Bento (1,2) ((1) Department of Biomedical Engineering, University of Calgary, Calgary, Canada, (2) Department of Electrical and Software Engineering, University of Calgary, Calgary, Canada, (3) Department of Computer Science, University of Calgary, Calgary, Canada, (4) Department of Philosophy, University of Calgary, Calgary, Canada, (5) Department of Geomatics Engineering, University of Calgary, Calgary, Canada, (6) Department of Political Science, University of Calgary, Calgary, Canada, (7) Werklund School of Education, Specialization, Leadership, University of Calgary, Calgary, Canada, (8) Cumming School of Medicine, Department of Radiology, University of Calgary, Calgary, Canada, (9) Department of Communication, Media, and Film, University of Calgary, Calgary, Canada, (10) Faculty of Social Work, University of Calgary, Calgary, Canada, (11) Centre for Health Informatics, University of Calgary, Calgary, Canada, (12) Faculty of Veterinary Medicine, University of Calgary, Calgary, Canada)
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Artificial Intelligence (AI) has paved the way for revolutionary decision-making processes, which if harnessed appropriately, can contribute to advancements in various sectors, from healthcare to economics. However, its black box nature presents significant ethical challenges related to bias and transparency. AI applications are hugely impacted by biases, presenting inconsistent and unreliable findings, leading to significant costs and consequences, highlighting and perpetuating inequalities and unequal access to resources. Hence, developing safe, reliable, ethical, and Trustworthy AI systems is essential. Our team of researchers working with Trustworthy and Responsible AI, part of the Transdisciplinary Scholarship Initiative within the University of Calgary, conducts research on Trustworthy and Responsible AI, including fairness, bias mitigation, reproducibility, generalization, interpretability, and authenticity. In this paper, we review and discuss the intricacies of AI biases, definitions, methods of detection and mitigation, and metrics for evaluating bias. We also discuss open challenges with regard to the trustworthiness and widespread application of AI across diverse domains of human-centric decision making, as well as guidelines to foster Responsible and Trustworthy AI models.
Title:
```
  TagOOD: A Novel Approach to Out-of-Distribution Detection via Vision-Language Representations and Class Center Learning
```
Authors: Jinglun Li, Xinyu Zhou, Kaixun Jiang, Lingyi Hong, Pinxue Guo, Zhaoyu Chen, Weifeng Ge, Wenqiang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multimodal fusion, leveraging data like vision and language, is rapidly gaining traction. This enriched data representation improves performance across various tasks. Existing methods for out-of-distribution (OOD) detection, a critical area where AI models encounter unseen data in real-world scenarios, rely heavily on whole-image features. These image-level features can include irrelevant information that hinders the detection of OOD samples, ultimately limiting overall performance. In this paper, we propose \textbf{TagOOD}, a novel approach for OOD detection that leverages vision-language representations to achieve label-free object feature decoupling from whole images. This decomposition enables a more focused analysis of object semantics, enhancing OOD detection performance. Subsequently, TagOOD trains a lightweight network on the extracted object features to learn representative class centers. These centers capture the central tendencies of IND object classes, minimizing the influence of irrelevant image features during OOD detection. Finally, our approach efficiently detects OOD samples by calculating distance-based metrics as OOD scores between learned centers and test samples. We conduct extensive experiments to evaluate TagOOD on several benchmark datasets and demonstrate its superior performance compared to existing OOD detection methods. This work presents a novel perspective for further exploration of multimodal information utilization in OOD detection, with potential applications across various tasks.
Title:
```
  Hierarchical Visual Categories Modeling: A Joint Representation Learning and Density Estimation Framework for Out-of-Distribution Detection
```
Authors: Jinglun Li, Xinyu Zhou, Pinxue Guo, Yixuan Sun, Yiwen Huang, Weifeng Ge, Wenqiang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Detecting out-of-distribution inputs for visual recognition models has become critical in safe deep learning. This paper proposes a novel hierarchical visual category modeling scheme to separate out-of-distribution data from in-distribution data through joint representation learning and statistical modeling. We learn a mixture of Gaussian models for each in-distribution category. There are many Gaussian mixture models to model different visual categories. With these Gaussian models, we design an in-distribution score function by aggregating multiple Mahalanobis-based metrics. We don't use any auxiliary outlier data as training samples, which may hurt the generalization ability of out-of-distribution detection algorithms. We split the ImageNet-1k dataset into ten folds randomly. We use one fold as the in-distribution dataset and the others as out-of-distribution datasets to evaluate the proposed method. We also conduct experiments on seven popular benchmarks, including CIFAR, iNaturalist, SUN, Places, Textures, ImageNet-O, and OpenImage-O. Extensive experiments indicate that the proposed method outperforms state-of-the-art algorithms clearly. Meanwhile, we find that our visual representation has a competitive performance when compared with features learned by classical methods. These results demonstrate that the proposed method hasn't weakened the discriminative ability of visual recognition models and keeps high efficiency in detecting out-of-distribution samples.
Title:
```
  VFLIP: A Backdoor Defense for Vertical Federated Learning via Identification and Purification
```
Authors: Yungi Cho, Woorim Han, Miseon Yu, Ho Bae, Yunheung Paek
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Vertical Federated Learning (VFL) focuses on handling vertically partitioned data over FL participants. Recent studies have discovered a significant vulnerability in VFL to backdoor attacks which specifically target the distinct characteristics of VFL. Therefore, these attacks may neutralize existing defense mechanisms designed primarily for Horizontal Federated Learning (HFL) and deep neural networks. In this paper, we present the first backdoor defense, called VFLIP, specialized for VFL. VFLIP employs the identification and purification techniques that operate at the inference stage, consequently improving the robustness against backdoor attacks to a great extent. VFLIP first identifies backdoor-triggered embeddings by adopting a participant-wise anomaly detection approach. Subsequently, VFLIP conducts purification which removes the embeddings identified as malicious and reconstructs all the embeddings based on the remaining embeddings. We conduct extensive experiments on CIFAR10, CINIC10, Imagenette, NUS-WIDE, and BankMarketing to demonstrate that VFLIP can effectively mitigate backdoor attacks in VFL. this https URL
Title:
```
  Can Visual Language Models Replace OCR-Based Visual Question Answering Pipelines in Production? A Case Study in Retail
```
Authors: Bianca Lamm, Janis Keuper
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Most production-level deployments for Visual Question Answering (VQA) tasks are still build as processing pipelines of independent steps including image pre-processing, object- and text detection, Optical Character Recognition (OCR) and (mostly supervised) object classification. However, the recent advances in vision Foundation Models [25] and Vision Language Models (VLMs) [23] raise the question if these custom trained, multi-step approaches can be replaced with pre-trained, single-step VLMs. This paper analyzes the performance and limits of various VLMs in the context of VQA and OCR [5, 9, 12] tasks in a production-level scenario. Using data from the Retail-786k [10] dataset, we investigate the capabilities of pre-trained VLMs to answer detailed questions about advertised products in images. Our study includes two commercial models, GPT-4V [16] and GPT-4o [17], as well as four open-source models: InternVL [5], LLaVA 1.5 [12], LLaVA-NeXT [13], and CogAgent [9]. Our initial results show, that there is in general no big performance gap between open-source and commercial models. However, we observe a strong task dependent variance in VLM performance: while most models are able to answer questions regarding the product brand and price with high accuracy, they completely fail at the same time to correctly identity the specific product name or discount. This indicates the problem of VLMs to solve fine-grained classification tasks as well to model the more abstract concept of discounts.
Title:
```
  CSAD: Unsupervised Component Segmentation for Logical Anomaly Detection
```
Authors: Yu-Hsuan Hsieh, Shang-Hong Lai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to autonomously generate training labels for a lightweight segmentation network without human labeling. Integrating this new segmentation technique with our proposed Patch Histogram module and the Local-Global Student-Teacher (LGST) module, we achieve a detection AUROC of 95.3% in the MVTec LOCO AD dataset, which surpasses previous SOTA methods. Furthermore, our proposed method provides lower latency and higher throughput than most existing approaches.
Title:
```
  Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection
```
Authors: Sondos Mohamed, Walter Zimmer, Ross Greer, Ahmed Alaaeldin Ghita, Modesto Castrillón-Santana, Mohan Trivedi, Alois Knoll, Salvatore Mario Carta, Mirko Marras
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Accurately detecting 3D objects from monocular images in dynamic roadside scenarios remains a challenging problem due to varying camera perspectives and unpredictable scene conditions. This paper introduces a two-stage training strategy to address these challenges. Our approach initially trains a model on the large-scale synthetic dataset, RoadSense3D, which offers a diverse range of scenarios for robust feature learning. Subsequently, we fine-tune the model on a combination of real-world datasets to enhance its adaptability to practical conditions. Experimental results of the Cube R-CNN model on challenging public benchmarks show a remarkable improvement in detection performance, with a mean average precision rising from 0.26 to 12.76 on the TUM Traffic A9 Highway dataset and from 2.09 to 6.60 on the DAIR-V2X-I dataset when performing transfer learning. Code, data, and qualitative video results are available on the project website: this https URL.
Title:
```
  Semantic and goal-oriented edge computing for satellite Earth Observation
```
Authors: Beatriz Soret, Israel Leyva-Mayorga, Antonio M. Mercado-Martínez, Marco Moretti, Antonio Jurado-Navas, Marc Martinez-Gost, Celia Sánchez de Miguel, Ainoa Salas-Prendes, Petar Popovski
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The integration of Semantic Communications (SemCom) and edge computing in space networks enables the optimal allocation of the scarce energy, computing, and communication resources for data-intensive applications. We use Earth Observation (EO) as a canonical functionality of satellites and review its main characteristics and challenges. We identify the potential of the space segment, represented by a low Earth orbit (LEO) satellite constellation, to serve as an edge layer for distributed intelligence. Based on that, propose a system architecture that supports semantic and goal-oriented applications for image reconstruction and object detection and localization. The simulation results show the intricate trade-offs among energy, time, and task-performance using a real dataset and State-of-the-Art (SoA) processing and communication parameters.
Title:
```
  MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion
```
Authors: Yanglin Deng, Tianyang Xu, Chunyang Cheng, Xiao-Jun Wu, Josef Kittler
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In recent years, Multi-Modality Image Fusion (MMIF) has been applied to many fields, which has attracted many scholars to endeavour to improve the fusion performance. However, the prevailing focus has predominantly been on the architecture design, rather than the training strategies. As a low-level vision task, image fusion is supposed to quickly deliver output images for observation and supporting downstream tasks. Thus, superfluous computational and storage overheads should be avoided. In this work, a lightweight Distilled Mini-Model with a Dynamic Refresh strategy (MMDRFuse) is proposed to achieve this objective. To pursue model parsimony, an extremely small convolutional network with a total of 113 trainable parameters (0.44 KB) is obtained by three carefully designed supervisions. First, digestible distillation is constructed by emphasising external spatial feature consistency, delivering soft supervision with balanced details and saliency for the target network. Second, we develop a comprehensive loss to balance the pixel, gradient, and perception clues from the source images. Third, an innovative dynamic refresh training strategy is used to collaborate history parameters and current supervision during training, together with an adaptive adjust function to optimise the fusion network. Extensive experiments on several public datasets demonstrate that our method exhibits promising advantages in terms of model efficiency and complexity, with superior performance in multiple image fusion tasks and downstream pedestrian detection application. The code of this work is publicly available at this https URL.
Title:
```
  RIDE: Boosting 3D Object Detection for LiDAR Point Clouds via Rotation-Invariant Analysis
```
Authors: Zhaoxuan Wang, Xu Han, Hongxin Liu, Xianzhi Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The rotation robustness property has drawn much attention to point cloud analysis, whereas it still poses a critical challenge in 3D object detection. When subjected to arbitrary rotation, most existing detectors fail to produce expected outputs due to the poor rotation robustness. In this paper, we present RIDE, a pioneering exploration of Rotation-Invariance for the 3D LiDAR-point-based object DEtector, with the key idea of designing rotation-invariant features from LiDAR scenes and then effectively incorporating them into existing 3D detectors. Specifically, we design a bi-feature extractor that extracts (i) object-aware features though sensitive to rotation but preserve geometry well, and (ii) rotation-invariant features, which lose geometric information to a certain extent but are robust to rotation. These two kinds of features complement each other to decode 3D proposals that are robust to arbitrary rotations. Particularly, our RIDE is compatible and easy to plug into the existing one-stage and two-stage 3D detectors, and boosts both detection performance and rotation robustness. Extensive experiments on the standard benchmarks showcase that the mean average precision (mAP) and rotation robustness can be significantly boosted by integrating with our RIDE, with +5.6% mAP and 53% rotation robustness improvement on KITTI, +5.1% and 28% improvement correspondingly on nuScenes. The code will be available soon.
Title:
```
  Deep Learning Based Speckle Filtering for Polarimetric SAR Images. Application to Sentinel-1
```
Authors: Alejandro Mestre-Quereda, Juan M. Lopez-Sanchez
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Speckle suppression in synthetic aperture radar (SAR) images is a key processing step which continues to be a research topic. A wide variety of methods, using either spatially-based approaches or transform-based strategies, have been developed and have shown to provide outstanding results. However, recent advances in deep learning techniques and their application to SAR image despeckling have been demonstrated to offer state-of-the-art results. Unfortunately, they have been mostly applied to single-polarimetric images. The extension of a deep learning-based approach for speckle removal to polarimetric SAR (PolSAR) images is complicated because of the complex nature of the measured covariance matrices for every image pixel, the properties of which must be preserved during filtering. In this work, we propose a complete framework to remove speckle in polarimetric SAR images using a convolutional neural network. The methodology includes a reversible transformation of the original complex covariance matrix to obtain a set of real-valued intensity bands which are fed to the neural network. In addition, the proposed method includes a change detection strategy to avoid the neural network to learn erroneous features in areas strongly affected by temporal changes, so that the network only learns the underlying speckle component present in the data. The method is implemented and tested with dual-polarimetric images acquired by Sentinel-1. Experiments show that the proposed approach offers exceptional results in both speckle reduction and resolution preservation. More importantly, it is also shown that the neural network is not generating artifacts or introducing bias in the filtered images, making them suitable for further polarimetric processing and exploitation.
Title:
```
  TempoFormer: A Transformer for Temporally-aware Representations in Change Detection
```
Authors: Talia Tseriotou, Adam Tsakalidis, Maria Liakata
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Dynamic representation learning plays a pivotal role in understanding the evolution of linguistic content over time. On this front both context and time dynamics as well as their interplay are of prime importance. Current approaches model context via pre-trained representations, which are typically temporally agnostic. Previous work on modeling context and temporal dynamics has used recurrent methods, which are slow and prone to overfitting. Here we introduce TempoFormer, the fist task-agnostic transformer-based and temporally-aware model for dynamic representation learning. Our approach is jointly trained on inter and intra context dynamics and introduces a novel temporal variation of rotary positional embeddings. The architecture is flexible and can be used as the temporal representation foundation of other models or applied to different transformer-based architectures. We show new SOTA performance on three different real-time change detection tasks.
Title:
```
  Advanced POD-Based Performance Evaluation of Classifiers Applied to Human Driver Lane Changing Prediction
```
Authors: Zahra Rastin, Dirk Söffker
Subjects: Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Machine learning (ML) classifiers serve as essential tools facilitating classification and prediction across various domains. The performance of these algorithms should be known to ensure their reliable application. In certain fields, receiver operating characteristic and precision-recall curves are frequently employed to assess machine learning algorithms without accounting for the impact of process parameters. However, it may be essential to evaluate the performance of these algorithms in relation to such parameters. As a performance evaluation metric capable of considering the effects of process parameters, this paper uses a modified probability of detection (POD) approach to assess the reliability of ML-based algorithms. As an example, the POD-based approach is employed to assess ML models used for predicting the lane changing behavior of a vehicle driver. The time remaining to the predicted (and therefore unknown) lane changing event is considered as process parameter. The hit/miss approach to POD is taken here and modified by considering the probability of lane changing derived from ML algorithms at each time step, and obtaining the final result of the analysis accordingly. This improves the reliability of results compared to the standard hit/miss approach, which considers the outcome of the classifiers as either 0 or 1, while also simplifying evaluation compared to the â versus a approach. Performance evaluation results of the proposed approach are compared with those obtained with the standard hit/miss approach and a pre-developed â versus a approach to validate the effectiveness of the proposed method. The comparison shows that this method provides an averaging conservative behavior with the advantage of enhancing the reliability of the hit/miss approach to POD while retaining its simplicity.
Title:
```
  Addressing the challenges of loop detection in agricultural environments
```
Authors: Nicolás Soncini, Javier Civera, Taihú Pire
Subjects: Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depend on the robustness of loop detection and closure, for which the literature is scarce. In this work we propose a novel method to pave the way towards robust loop detection in open fields, particularly in agricultural settings, based on local feature search and stereo geometric refinement, with a final stage of relative pose estimation. Our method consistently achieves good loop detections, with a median error of 15cm. We aim to characterize open fields as a novel environment for loop detection, understanding the limitations and problems that arise when dealing with them.
Title:
```
  Neural Network Equalizers and Successive Interference Cancellation for Bandlimited Channels with a Nonlinearity
```
Authors: Daniel Plabst, Tobias Prinz, Francesca Diedolo, Thomas Wiegart, Georg Böcherer, Norbert Hanik, Gerhard Kramer
Subjects: Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Neural networks (NNs) inspired by the forward-backward algorithm (FBA) are used as equalizers for bandlimited channels with a memoryless nonlinearity. The NN-equalizers are combined with successive interference cancellation (SIC) to approach the information rates of joint detection and decoding (JDD) with considerably less complexity than JDD and other existing equalizers. Simulations for short-haul optical fiber links with square-law detection illustrate the gains.
Title:
```
  Object Detection for Vehicle Dashcams using Transformers
```
Authors: Osama Mustafa, Khizer Ali, Anam Bibi, Imran Siddiqi, Momina Moetesum
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The use of intelligent automation is growing significantly in the automotive industry, as it assists drivers and fleet management companies, thus increasing their productivity. Dash cams are now been used for this purpose which enables the instant identification and understanding of multiple objects and occurrences in the surroundings. In this paper, we propose a novel approach for object detection in dashcams using transformers. Our system is based on the state-of-the-art DEtection TRansformer (DETR), which has demonstrated strong performance in a variety of conditions, including different weather and illumination scenarios. The use of transformers allows for the consideration of contextual information in decisionmaking, improving the accuracy of object detection. To validate our approach, we have trained our DETR model on a dataset that represents real-world conditions. Our results show that the use of intelligent automation through transformers can significantly enhance the capabilities of dashcam systems. The model achieves an mAP of 0.95 on detection.
Title:
```
  Mining Field Data for Tree Species Recognition at Scale
```
Authors: Dimitri Gominski, Daniel Ortiz-Gonzalo, Martin Brandt, Maurice Mugabowindekwe, Rasmus Fensholt
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Individual tree species labels are particularly hard to acquire due to the expert knowledge needed and the limitations of photointerpretation. Here, we present a methodology to automatically mine species labels from public forest inventory data, using available pretrained tree detection models. We identify tree instances in aerial imagery and match them with field data with close to zero human involvement. We conduct a series of experiments on the resulting dataset, and show a beneficial effect when adding noisy or even unlabeled data points, highlighting a strong potential for large-scale individual species mapping.
Title:
```
  Network transferability of adversarial patches in real-time object detection
```
Authors: Jens Bayer, Stefan Becker, David Münch, Michael Arens
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Adversarial patches in computer vision can be used, to fool deep neural networks and manipulate their decision-making process. One of the most prominent examples of adversarial patches are evasion attacks for object detectors. By covering parts of objects of interest, these patches suppress the detections and thus make the target object 'invisible' to the object detector. Since these patches are usually optimized on a specific network with a specific train dataset, the transferability across multiple networks and datasets is not given. This paper addresses these issues and investigates the transferability across numerous object detector architectures. Our extensive evaluation across various models on two distinct datasets indicates that patches optimized with larger models provide better network transferability than patches that are optimized with smaller models.
Title:
```
  An Empirical Study of API Misuses of Data-Centric Libraries
```
Authors: Akalanka Galappaththi, Sarah Nadi, Christoph Treude
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Developers rely on third-party library Application Programming Interfaces (APIs) when developing software. However, libraries typically come with assumptions and API usage constraints, whose violation results in API misuse. API misuses may result in crashes or incorrect behavior. Even though API misuse is a well-studied area, a recent study of API misuse of deep learning libraries showed that the nature of these misuses and their symptoms are different from misuses of traditional libraries, and as a result highlighted potential shortcomings of current misuse detection tools. We speculate that these observations may not be limited to deep learning API misuses but may stem from the data-centric nature of these APIs. Data-centric libraries often deal with diverse data structures, intricate processing workflows, and a multitude of parameters, which can make them inherently more challenging to use correctly. Therefore, understanding the potential misuses of these libraries is important to avoid unexpected application behavior. To this end, this paper contributes an empirical study of API misuses of five data-centric libraries that cover areas such as data processing, numerical computation, machine learning, and visualization. We identify misuses of these libraries by analyzing data from both Stack Overflow and GitHub. Our results show that many of the characteristics of API misuses observed for deep learning libraries extend to misuses of the data-centric library APIs we study. We also find that developers tend to misuse APIs from data-centric libraries, regardless of whether the API directive appears in the documentation. Overall, our work exposes the challenges of API misuse in data-centric libraries, rather than only focusing on deep learning libraries. Our collected misuses and their characterization lay groundwork for future research to help reduce misuses of these libraries.
Title:
```
  What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector
```
Authors: Muhammad Yaseen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This study presents a detailed analysis of the YOLOv8 object detection model, focusing on its architecture, training techniques, and performance improvements over previous iterations like YOLOv5. Key innovations, including the CSPNet backbone for enhanced feature extraction, the FPN+PAN neck for superior multi-scale object detection, and the transition to an anchor-free approach, are thoroughly examined. The paper reviews YOLOv8's performance across benchmarks like Microsoft COCO and Roboflow 100, highlighting its high accuracy and real-time capabilities across diverse hardware platforms. Additionally, the study explores YOLOv8's developer-friendly enhancements, such as its unified Python package and CLI, which streamline model training and deployment. Overall, this research positions YOLOv8 as a state-of-the-art solution in the evolving object detection field.
Title:
```
  microYOLO: Towards Single-Shot Object Detection on Microcontrollers
```
Authors: Mark Deutel, Christopher Mutschler, Jürgen Teich
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.
Title:
```
  Robust Statistical Scaling of Outlier Scores: Improving the Quality of Outlier Probabilities for Outliers (Extended Version)
```
Authors: Philipp Röchner, Henrique O. Marques, Ricardo J. G. B. Campello, Arthur Zimek, Franz Rothlauf
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Outlier detection algorithms typically assign an outlier score to each observation in a dataset, indicating the degree to which an observation is an outlier. However, these scores are often not comparable across algorithms and can be difficult for humans to interpret. Statistical scaling addresses this problem by transforming outlier scores into outlier probabilities without using ground-truth labels, thereby improving interpretability and comparability across algorithms. However, the quality of this transformation can be different for outliers and inliers. Missing outliers in scenarios where they are of particular interest - such as healthcare, finance, or engineering - can be costly or dangerous. Thus, ensuring good probabilities for outliers is essential. This paper argues that statistical scaling, as commonly used in the literature, does not produce equally good probabilities for outliers as for inliers. Therefore, we propose robust statistical scaling, which uses robust estimators to improve the probabilities for outliers. We evaluate several variants of our method against other outlier score transformations for real-world datasets and outlier detection algorithms, where it can improve the probabilities for outliers.
Title:
```
  Enhancing Intrusion Detection in IoT Environments: An Advanced Ensemble Approach Using Kolmogorov-Arnold Networks
```
Authors: Amar Amouri, Mohamad Mahmoud Al Rahhal, Yakoub Bazi, Ismail Butun, Imad Mahgoub
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In recent years, the evolution of machine learning techniques has significantly impacted the field of intrusion detection, particularly within the context of the Internet of Things (IoT). As IoT networks expand, the need for robust security measures to counteract potential threats has become increasingly critical. This paper introduces a hybrid Intrusion Detection System (IDS) that synergistically combines Kolmogorov-Arnold Networks (KANs) with the XGBoost algorithm. Our proposed IDS leverages the unique capabilities of KANs, which utilize learnable activation functions to model complex relationships within data, alongside the powerful ensemble learning techniques of XGBoost, known for its high performance in classification tasks. This hybrid approach not only enhances the detection accuracy but also improves the interpretability of the model, making it suitable for dynamic and intricate IoT environments. Experimental evaluations demonstrate that our hybrid IDS achieves an impressive detection accuracy exceeding 99% in distinguishing between benign and malicious activities. Additionally, we were able to achieve F1 scores, precision, and recall that exceeded 98%. Furthermore, we conduct a comparative analysis against traditional Multi-Layer Perceptron (MLP) networks, assessing performance metrics such as Precision, Recall, and F1-score. The results underscore the efficacy of integrating KANs with XGBoost, highlighting the potential of this innovative approach to significantly strengthen the security framework of IoT networks.
Title:
```
  Fall Detection for Smart Living using YOLOv5
```
Authors: Gracile Astlin Pereira
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This work introduces a fall detection system using the YOLOv5mu model, which achieved a mean average precision (mAP) of 0.995, demonstrating exceptional accuracy in identifying fall events within smart home environments. Enhanced by advanced data augmentation techniques, the model demonstrates significant robustness and adaptability across various conditions. The integration of YOLOv5mu offers precise, real-time fall detection, which is crucial for improving safety and emergency response for residents. Future research will focus on refining the system by incorporating contextual data and exploring multi-sensor approaches to enhance its performance and practical applicability in diverse environments.
Title:
```
  Efficient Slice Anomaly Detection Network for 3D Brain MRI Volume
```
Authors: Zeduo Zhang, Yalda Mohsenzadeh
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Current anomaly detection methods excel with benchmark industrial data but struggle with natural images and medical data due to varying definitions of 'normal' and 'abnormal.' This makes accurate identification of deviations in these fields particularly challenging. Especially for 3D brain MRI data, all the state-of-the-art models are reconstruction-based with 3D convolutional neural networks which are memory-intensive, time-consuming and producing noisy outputs that require further post-processing. We propose a framework called Simple Slice-based Network (SimpleSliceNet), which utilizes a model pre-trained on ImageNet and fine-tuned on a separate MRI dataset as a 2D slice feature extractor to reduce computational cost. We aggregate the extracted features to perform anomaly detection tasks on 3D brain MRI volumes. Our model integrates a conditional normalizing flow to calculate log likelihood of features and employs the Semi-Push-Pull Mechanism to enhance anomaly detection accuracy. The results indicate improved performance, showcasing our model's remarkable adaptability and effectiveness when addressing the challenges exists in brain MRI data. In addition, for the large-scale 3D brain volumes, our model SimpleSliceNet outperforms the state-of-the-art 2D and 3D models in terms of accuracy, memory usage and time consumption. Code is available at: https://anonymous.4open.science/r/SimpleSliceNet-8EA3.
Title:
```
  ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution
```
Authors: Sungduk Yu, Brian L. White, Anahita Bhiwandiwalla, Musashi Hinck, Matthew Lyle Olson, Tung Nguyen, Vasudev Lal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Detecting and attributing temperature increases due to climate change is crucial for understanding global warming and guiding adaptation strategies. The complexity of distinguishing human-induced climate signals from natural variability has challenged traditional detection and attribution (D&A) approaches, which seek to identify specific "fingerprints" in climate response variables. Deep learning offers potential for discerning these complex patterns in expansive spatial datasets. However, lack of standard protocols has hindered consistent comparisons across studies. We introduce ClimDetect, a standardized dataset of over 816k daily climate snapshots, designed to enhance model accuracy in identifying climate change signals. ClimDetect integrates various input and target variables used in past research, ensuring comparability and consistency. We also explore the application of vision transformers (ViT) to climate data, a novel and modernizing approach in this context. Our open-access data and code serve as a benchmark for advancing climate science through improved model evaluations. ClimDetect is publicly accessible via Huggingface dataet respository at: this https URL.
Title:
```
  Spatio-Temporal Context Prompting for Zero-Shot Action Detection
```
Authors: Wei-Jhe Huang, Min-Hung Chen, Shang-Hong Lai
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in this https URL.
Keyword: face recognition

Title:
```
  Started Off Local, Now We're in the Cloud: Forensic Examination of the Amazon Echo Show 15 Smart Display
```
Authors: Jona Crasselt, Gaston Pugliese
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Amazon Echo is one of the most popular product families of smart speakers and displays. Considering their growing presence in modern households as well as the digital traces associated with residents' interactions with these devices, analyses of Echo products are likely to become more common for forensic investigators at "smart home" crime scenes. With this in mind, we present the first forensic examination of the Echo Show 15, Amazon's largest smart display running on Fire OS and the first Echo device with Visual ID, a face recognition feature. We unveil a non-invasive method for accessing the unencrypted file system of the Echo Show 15 based on an undocumented pinout for the eMMC interface which we discovered on the main logic board. On the device, we identify various local usage artifacts, such as searched products, streamed movies, visited websites, metadata of photos and videos as well as logged events of Visual ID about movements and users detected by the built-in camera. Furthermore, we utilize an insecurely stored token on the Echo Show 15 to obtain access to remote user artifacts in Amazon's cloud, including Alexa voice requests, calendars, contacts, conversations, photos, and videos. In this regard, we also identify new Amazon APIs through network traffic analysis of two companion apps, namely Alexa and Photos. Overall, in terms of practical relevance, our findings demonstrate a non-destructive way of data acquisition for Echo Show 15 devices as well as how to lift the scope of forensic traces from local artifacts on the device to remote artifacts stored in the cloud.
Keyword: augmentation

Title:
```
  Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images
```
Authors: Silvia Seidlitz, Jan Sellner, Alexander Studier-Fischer, Alessandro Motta, Berkin Özdemir, Beat P. Müller-Stich, Felix Nickel, Lena Maier-Hein
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Robust semantic segmentation of intraoperative image data holds promise for enabling automatic surgical scene understanding and autonomous robotic surgery. While model development and validation are primarily conducted on idealistic scenes, geometric domain shifts, such as occlusions of the situs, are common in real-world open surgeries. To close this gap, we (1) present the first analysis of state-of-the-art (SOA) semantic segmentation models when faced with geometric out-of-distribution (OOD) data, and (2) propose an augmentation technique called "Organ Transplantation", to enhance generalizability. Our comprehensive validation on six different OOD datasets, comprising 600 RGB and hyperspectral imaging (HSI) cubes from 33 pigs, each annotated with 19 classes, reveals a large performance drop in SOA organ segmentation models on geometric OOD data. This performance decline is observed not only in conventional RGB data (with a dice similarity coefficient (DSC) drop of 46 %) but also in HSI data (with a DSC drop of 45 %), despite the richer spectral information content. The performance decline increases with the spatial granularity of the input data. Our augmentation technique improves SOA model performance by up to 67 % for RGB data and 90 % for HSI data, achieving performance at the level of in-distribution performance on real OOD test data. Given the simplicity and effectiveness of our augmentation method, it is a valuable tool for addressing geometric domain shifts in surgical scene segmentation, regardless of the underlying model. Our code and pre-trained models are publicly available at this https URL.
Title:
```
  A Statistical Framework for Data-dependent Retrieval-Augmented Models
```
Authors: Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Modern ML systems increasingly augment input instances with additional relevant information to enhance final prediction. Despite growing interest in such retrieval-augmented models, their fundamental properties and training are not well understood. We propose a statistical framework to study such models with two components: 1) a {\em retriever} to identify the relevant information out of a large corpus via a data-dependent metric; and 2) a {\em predictor} that consumes the input instances along with the retrieved information to make the final predictions. We present a principled method for end-to-end training of both components and draw connections with various training approaches in the literature. Furthermore, we establish excess risk bounds for retrieval-augmented models while delineating the contributions of both retriever and predictor towards the model performance. We validate the utility of our proposed training methods along with the key takeaways from our statistical analysis on open domain question answering task where retrieval augmentation is important.
Title:
```
  Legilimens: Practical and Unified Content Moderation for Large Language Model Services
```
Authors: Jialin Wu, Jiangyi Deng, Shengyuan Pang, Yanjiao Chen, Jiayang Xu, Xinfeng Li, Wenyuan Xu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.
Title:
```
  SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models
```
Authors: Dian Yu, Baolin Peng, Ye Tian, Linfeng Song, Haitao Mi, Dong Yu
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract There is a growing trend of teaching large language models (LLMs) to solve mathematical problems through coding. Existing studies primarily focus on prompting powerful, closed-source models to generate seed training data followed by in-domain data augmentation, equipping LLMs with considerable capabilities for code-aided mathematical reasoning. However, continually training these models on augmented data derived from a few datasets such as GSM8K may impair their generalization abilities and restrict their effectiveness to a narrow range of question types. Conversely, the potential of improving such LLMs by leveraging large-scale, expert-written, diverse math question-answer pairs remains unexplored. To utilize these resources and tackle unique challenges such as code response assessment, we propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. We also explore different alignment algorithms with self-generated instruction/preference data to foster continuous improvement. Experiments across both in-domain (up to +5.7%) and out-of-domain (+4.4%) benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
Title:
```
  GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model
```
Authors: Yongjie Fu, Yunlong Li, Xuan Di
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Autonomous driving training requires a diverse range of datasets encompassing various traffic conditions, weather scenarios, and road types. Traditional data augmentation methods often struggle to generate datasets that represent rare occurrences. To address this challenge, we propose GenDDS, a novel approach for generating driving scenarios generation by leveraging the capabilities of Stable Diffusion XL (SDXL), an advanced latent diffusion model. Our methodology involves the use of descriptive prompts to guide the synthesis process, aimed at producing realistic and diverse driving scenarios. With the power of the latest computer vision techniques, such as ControlNet and Hotshot-XL, we have built a complete pipeline for video generation together with SDXL. We employ the KITTI dataset, which includes real-world driving videos, to train the model. Through a series of experiments, we demonstrate that our model can generate high-quality driving videos that closely replicate the complexity and variability of real-world driving scenarios. This research contributes to the development of sophisticated training data for autonomous driving systems and opens new avenues for creating virtual environments for simulation and validation purposes.
Title:
```
  Fall Detection for Smart Living using YOLOv5
```
Authors: Gracile Astlin Pereira
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This work introduces a fall detection system using the YOLOv5mu model, which achieved a mean average precision (mAP) of 0.995, demonstrating exceptional accuracy in identifying fall events within smart home environments. Enhanced by advanced data augmentation techniques, the model demonstrates significant robustness and adaptability across various conditions. The integration of YOLOv5mu offers precise, real-time fall detection, which is crucial for improving safety and emergency response for residents. Future research will focus on refining the system by incorporating contextual data and exploring multi-sensor approaches to enhance its performance and practical applicability in diverse environments.

LeeKyungwook / get-arxiv-noti

New submissions for Thursday, 29 August 2024 (showing 325 of 325 entries ) #1250

Keyword: detection

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: face recognition

Title:

Keyword: augmentation

Title:

Title:

Title:

Title:

Title:

Title: