New submissions for Tuesday, 6 August 2024 (showing 557 of 557 entries )

Keyword: detection

Title:

      Evaluating and Enhancing Trustworthiness of LLMs in Perception Tasks

Authors: Malsha Ashani Mahawatta Dona, Beatriz Cabrero-Daniel, Yinan Yu, Christian Berger
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Today's advanced driver assistance systems (ADAS), like adaptive cruise control or rear collision warning, are finding broader adoption across vehicle classes. Integrating such advanced, multimodal Large Language Models (LLMs) on board a vehicle, which are capable of processing text, images, audio, and other data types, may have the potential to greatly enhance passenger comfort. Yet, an LLM's hallucinations are still a major challenge to be addressed. In this paper, we systematically assessed potential hallucination detection strategies for such LLMs in the context of object detection in vision-based data on the example of pedestrian detection and localization. We evaluate three hallucination detection strategies applied to two state-of-the-art LLMs, the proprietary GPT-4V and the open LLaVA, on two datasets (Waymo/US and PREPER CITY/Sweden). Our results show that these LLMs can describe a traffic situation to an impressive level of detail but are still challenged for further analysis activities such as object localization. We evaluate and extend hallucination detection approaches when applying these LLMs to video sequences in the example of pedestrian detection. Our experiments show that, at the moment, the state-of-the-art proprietary LLM performs much better than the open LLM. Furthermore, consistency enhancement techniques based on voting, such as the Best-of-Three (BO3) method, do not effectively reduce hallucinations in LLMs that tend to exhibit high false negatives in detecting pedestrians. However, extending the hallucination detection by including information from the past helps to improve results.
Title:
```
  Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization
```
Authors: Vinaya Sree Katamneni, Ajita Rattani
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In the digital age, the emergence of deepfakes and synthetic media presents a significant threat to societal and political integrity. Deepfakes based on multi-modal manipulation, such as audio-visual, are more realistic and pose a greater threat. Current multi-modal deepfake detectors are often based on the attention-based fusion of heterogeneous data streams from multiple modalities. However, the heterogeneous nature of the data (such as audio and visual signals) creates a distributional modality gap and poses a significant challenge in effective fusion and hence multi-modal deepfake detection. In this paper, we propose a novel multi-modal attention framework based on recurrent neural networks (RNNs) that leverages contextual information for audio-visual deepfake detection. The proposed approach applies attention to multi-modal multi-sequence representations and learns the contributing features among them for deepfake detection and localization. Thorough experimental validations on audio-visual deepfake datasets, namely FakeAVCeleb, AV-Deepfake1M, TVIL, and LAV-DF datasets, demonstrate the efficacy of our approach. Cross-comparison with the published studies demonstrates superior performance of our approach with an improved accuracy and precision by 3.47% and 2.05% in deepfake detection and localization, respectively. Thus, obtaining state-of-the-art performance. To facilitate reproducibility, the code and the datasets information is available at this https URL.
Title:
```
  Non-linear Analysis Based ECG Classification of Cardiovascular Disorders
```
Authors: Suraj Kumar Behera, Debanjali Bhattacharya, Ninad Aithal, Neelam Sinha
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Multi-channel ECG-based cardiac disorders detection has an impact on cardiac care and treatment. Limitations of existing methods included variation in ECG waveforms due to the location of electrodes, high non-linearity in the signal, and amplitude measurement in millivolts. The present study reports a non-linear analysis-based methodology that utilizes Recurrence plot visualization. The patterned occurrence of well-defined structures, such as the QRS complex, can be exploited effectively using Recurrence plots. This Recurrence-based method is applied to the publicly available Physikalisch-Technische Bundesanstalt (PTB) dataset from PhysioNet database, where we studied four classes of different cardiac disorders (Myocardial infarction, Bundle branch blocks, Cardiomyopathy, and Dysrhythmia) and healthy controls, achieving an impressive classification accuracy of 100%. Additionally, t-SNE plot visualizations of the latent space embeddings derived from Recurrence plots and Recurrence Quantification Analysis features reveal a clear demarcation between the considered cardiac disorders and healthy individuals, demonstrating the potential of this approach.
Title:
```
  Accelerating Domain-Aware Electron Microscopy Analysis Using Deep Learning Models with Synthetic Data and Image-Wide Confidence Scoring
```
Authors: Matthew J. Lynch, Ryan Jacobs, Gabriella Bruno, Priyam Patki, Dane Morgan, Kevin G. Field
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The integration of machine learning (ML) models enhances the efficiency, affordability, and reliability of feature detection in microscopy, yet their development and applicability are hindered by the dependency on scarce and often flawed manually labeled datasets and a lack of domain awareness. We addressed these challenges by creating a physics-based synthetic image and data generator, resulting in a machine learning model that achieves comparable precision (0.86), recall (0.63), F1 scores (0.71), and engineering property predictions (R2=0.82) to a model trained on human-labeled data. We enhanced both models by using feature prediction confidence scores to derive an image-wide confidence metric, enabling simple thresholding to eliminate ambiguous and out-of-domain images resulting in performance boosts of 5-30% with a filtering-out rate of 25%. Our study demonstrates that synthetic data can eliminate human reliance in ML and provides a means for domain awareness in cases where many feature detections per image are needed.
Title:
```
  Autonomous Integration of Bench-Top Wet Lab Equipment
```
Authors: Zachary Logan, Kam Undieh, Mohammad Goli
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Laboratory automation is an expensive and complicated endeavor with limited inflexible options for small-scale labs. We develop a prototype system for tending to a bench-top centrifuge using computer vision methods for color detection and circular Hough Transforms to detect and localize centrifuge buckets. Initial results show that the prototype is capable of automating the usage of regular bench-top lab equipment.
Title:
```
  Mitigating the Impact of Malware Evolution on API Sequence-based Windows Malware Detector
```
Authors: Xingyuan Wei, Ce Li, Qiujian Lv, Ning Li, Degang Sun, Yan Wang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In dynamic Windows malware detection, deep learning models are extensively deployed to analyze API sequences. Methods based on API sequences play a crucial role in malware prevention. However, due to the continuous updates of APIs and the changes in API sequence calls leading to the constant evolution of malware variants, the detection capability of API sequence-based malware detection models significantly diminishes over time. We observe that the API sequences of malware samples before and after evolution usually have similar malicious semantics. Specifically, compared to the original samples, evolved malware samples often use the API sequences of the pre-evolution samples to achieve similar malicious behaviors. For instance, they access similar sensitive system resources and extend new malicious functions based on the original functionalities. In this paper, we propose a frame(MME), a framework that can enhance existing API sequence-based malware detectors and mitigate the adverse effects of malware evolution. To help detection models capture the similar semantics of these post-evolution API sequences, our framework represents API sequences using API knowledge graphs and system resource encodings and applies contrastive learning to enhance the model's encoder. Results indicate that, compared to Regular Text-CNN, our framework can significantly reduce the false positive rate by 13.10% and improve the F1-Score by 8.47% on five years of data, achieving the best experimental results. Additionally, evaluations show that our framework can save on the human costs required for model maintenance. We only need 1% of the budget per month to reduce the false positive rate by 11.16% and improve the F1-Score by 6.44%.
Title:
```
  Automated Phishing Detection Using URLs and Webpages
```
Authors: Huilin Wang, Bryan Hooi
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Phishing detection is a critical cybersecurity task that involves the identification and neutralization of fraudulent attempts to obtain sensitive information, thereby safeguarding individuals and organizations from data breaches and financial loss. In this project, we address the constraints of traditional reference-based phishing detection by developing an LLM agent framework. This agent harnesses Large Language Models to actively fetch and utilize online information, thus providing a dynamic reference system for more accurate phishing detection. This innovation circumvents the need for a static knowledge base, offering a significant enhancement in adaptability and efficiency for automated security measures. The project report includes an initial study and problem analysis of existing solutions, which motivated us to develop a new framework. We demonstrate the framework with LLMs simulated as agents and detail the techniques required for construction, followed by a complete implementation with a proof-of-concept as well as experiments to evaluate our solution's performance against other similar solutions. The results show that our approach has achieved with accuracy of 0.945, significantly outperforms the existing solution(DynaPhish) by 0.445. Furthermore, we discuss the limitations of our approach and suggest improvements that could make it more effective. Overall, the proposed framework has the potential to enhance the effectiveness of current reference-based phishing detection approaches and could be adapted for real-world applications.
Title:
```
  Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection
```
Authors: Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, \textit{i.e.,} learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular deepfake detection benchmarks demonstrate that our proposed MkfaNet variants achieve superior performances in both within-domain and across-domain evaluations with impressive efficiency of parameter usage.
Title:
```
  IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection
```
Authors: Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
Title:
```
  A Comparative Analysis of CNN-based Deep Learning Models for Landslide Detection
```
Authors: Omkar Oak, Rukmini Nazre, Soham Naigaonkar, Suraj Sawant, Himadri Vaidya
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Landslides inflict substantial societal and economic damage, underscoring their global significance as recurrent and destructive natural disasters. Recent landslides in northern parts of India and Nepal have caused significant disruption, damaging infrastructure and posing threats to local communities. Convolutional Neural Networks (CNNs), a type of deep learning technique, have shown remarkable success in image processing. Because of their sophisticated architectures, advanced CNN-based models perform better in landslide detection than conventional algorithms. The purpose of this work is to investigate CNNs' potential in more detail, with an emphasis on comparison of CNN based models for better landslide detection. We compared four traditional semantic segmentation models (U-Net, LinkNet, PSPNet, and FPN) and utilized the ResNet50 backbone encoder to implement them. Moreover, we have experimented with the hyperparameters such as learning rates, batch sizes, and regularization techniques to fine-tune the models. We have computed the confusion matrix for each model and used performance metrics including precision, recall and f1-score to evaluate and compare the deep learning models. According to the experimental results, LinkNet gave the best results among the four models having an Accuracy of 97.49% and a F1-score of 85.7% (with 84.49% precision, 87.07% recall). We have also presented a comprehensive comparison of all pixel-wise confusion matrix results and the time taken to train each model.
Title:
```
  WaitGPT: Monitoring and Steering Conversational LLM Agent in Data Analysis with On-the-Fly Code Visualization
```
Authors: Liwenhan Xie, Chengbo Zheng, Haijun Xia, Huamin Qu, Chen Zhu-Tian
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Large language models (LLMs) support data analysis through conversational user interfaces, as exemplified in OpenAI's ChatGPT (formally known as Advanced Data Analysis or Code Interpreter). Essentially, LLMs produce code for accomplishing diverse analysis tasks. However, presenting raw code can obscure the logic and hinder user verification. To empower users with enhanced comprehension and augmented control over analysis conducted by LLMs, we propose a novel approach to transform LLM-generated code into an interactive visual representation. In the approach, users are provided with a clear, step-by-step visualization of the LLM-generated code in real time, allowing them to understand, verify, and modify individual data operations in the analysis. Our design decisions are informed by a formative study (N=8) probing into user practice and challenges. We further developed a prototype named WaitGPT and conducted a user study (N=12) to evaluate its usability and effectiveness. The findings from the user study reveal that WaitGPT facilitates monitoring and steering of data analysis performed by LLMs, enabling participants to enhance error detection and increase their overall confidence in the results.
Title:
```
  LAM3D: Leveraging Attention for Monocular 3D Object Detection
```
Authors: Diana-Alexandra Sas, Leandro Di Bella, Yangxintong Lyu, Florin Oniga, Adrian Munteanu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Since the introduction of the self-attention mechanism and the adoption of the Transformer architecture for Computer Vision tasks, the Vision Transformer-based architectures gained a lot of popularity in the field, being used for tasks such as image classification, object detection and image segmentation. However, efficiently leveraging the attention mechanism in vision transformers for the Monocular 3D Object Detection task remains an open question. In this paper, we present LAM3D, a framework that Leverages self-Attention mechanism for Monocular 3D object Detection. To do so, the proposed method is built upon a Pyramid Vision Transformer v2 (PVTv2) as feature extraction backbone and 2D/3D detection machinery. We evaluate the proposed method on the KITTI 3D Object Detection Benchmark, proving the applicability of the proposed solution in the autonomous driving domain and outperforming reference methods. Moreover, due to the usage of self-attention, LAM3D is able to systematically outperform the equivalent architecture that does not employ self-attention.
Title:
```
  Domain penalisation for improved Out-of-Distribution Generalisation
```
Authors: Shuvam Jena, Sushmetha Sumathi Rajendran, Karthik Seemakurthy, Sasithradevi A, Vijayalakshmi M, Prakash Poornachari
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In the field of object detection, domain generalisation (DG) aims to ensure robust performance across diverse and unseen target domains by learning the robust domain-invariant features corresponding to the objects of interest across multiple source domains. While there are many approaches established for performing DG for the task of classification, there has been a very little focus on object detection. In this paper, we propose a domain penalisation (DP) framework for the task of object detection, where the data is assumed to be sampled from multiple source domains and tested on completely unseen test domains. We assign penalisation weights to each domain, with the values updated based on the detection networks performance on the respective source domains. By prioritising the domains that needs more attention, our approach effectively balances the training process. We evaluate our solution on the GWHD 2021 dataset, a component of the WiLDS benchmark and we compare against ERM and GroupDRO as these are primarily loss function based. Our extensive experimental results reveals that the proposed approach improves the accuracy by 0.3 percent and 0.5 percent on validation and test out-of-distribution (OOD) sets, respectively for FasterRCNN. We also compare the performance of our approach on FCOS detector and show that our approach improves the baseline OOD performance over the existing approaches by 1.3 percent and 1.4 percent on validation and test sets, respectively. This study underscores the potential of performance based domain penalisation in enhancing the generalisation ability of object detection models across diverse environments.
Title:
```
  Advancing Green AI: Efficient and Accurate Lightweight CNNs for Rice Leaf Disease Identification
```
Authors: Khairun Saddami, Yudha Nurdin, Mutia Zahramita, Muhammad Shahreeza Safiruz
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Rice plays a vital role as a primary food source for over half of the world's population, and its production is critical for global food security. Nevertheless, rice cultivation is frequently affected by various diseases that can severely decrease yield and quality. Therefore, early and accurate detection of rice diseases is necessary to prevent their spread and minimize crop losses. In this research, we explore three mobile-compatible CNN architectures, namely ShuffleNet, MobileNetV2, and EfficientNet-B0, for rice leaf disease classification. These models are selected due to their compatibility with mobile devices, as they demand less computational power and memory compared to other CNN models. To enhance the performance of the three models, we added two fully connected layers separated by a dropout layer. We used early stop creation to prevent the model from being overfiting. The results of the study showed that the best performance was achieved by the EfficientNet-B0 model with an accuracy of 99.8%. Meanwhile, MobileNetV2 and ShuffleNet only achieved accuracies of 84.21% and 66.51%, respectively. This study shows that EfficientNet-B0 when combined with the proposed layer and early stop, can produce a high-accuracy model. Keywords: rice leaf detection; green AI; smart agriculture; EfficientNet;
Title:
```
  Large Language Models for Equivalent Mutant Detection: How Far Are We?
```
Authors: Zhao Tian, Honglin Shu, Dong Wang, Xuejie Cao, Yasutaka Kamei, Junjie Chen
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Mutation testing is vital for ensuring software quality. However, the presence of equivalent mutants is known to introduce redundant cost and bias issues, hindering the effectiveness of mutation testing in practical use. Although numerous equivalent mutant detection (EMD) techniques have been proposed, they exhibit limitations due to the scarcity of training data and challenges in generalizing to unseen mutants. Recently, large language models (LLMs) have been extensively adopted in various code-related tasks and have shown superior performance by more accurately capturing program semantics. Yet the performance of LLMs in equivalent mutant detection remains largely unclear. In this paper, we conduct an empirical study on 3,302 method-level Java mutant pairs to comprehensively investigate the effectiveness and efficiency of LLMs for equivalent mutant detection. Specifically, we assess the performance of LLMs compared to existing EMD techniques, examine the various strategies of LLMs, evaluate the orthogonality between EMD techniques, and measure the time overhead of training and inference. Our findings demonstrate that LLM-based techniques significantly outperform existing techniques (i.e., the average improvement of 35.69% in terms of F1-score), with the fine-tuned code embedding strategy being the most effective. Moreover, LLM-based techniques offer an excellent balance between cost (relatively low training and inference time) and effectiveness. Based on our findings, we further discuss the impact of model size and embedding quality, and provide several promising directions for future research. This work is the first to examine LLMs in equivalent mutant detection, affirming their effectiveness and efficiency.
Title:
```
  Optimizing Intrusion Detection System Performance Through Synergistic Hyperparameter Tuning and Advanced Data Processing
```
Authors: Samia Saidane, Francesco Telch, Kussai Shahin, Fabrizio Granelli
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Intrusion detection is vital for securing computer networks against malicious activities. Traditional methods struggle to detect complex patterns and anomalies in network traffic effectively. To address this issue, we propose a system combining deep learning, data balancing (K-means + SMOTE), high-dimensional reduction (PCA and FCBF), and hyperparameter optimization (Extra Trees and BO-TPE) to enhance intrusion detection performance. By training on extensive datasets like CIC IDS 2018 and CIC IDS 2017, our models demonstrate robust performance and generalization. Notably, the ensemble model "VGG19" consistently achieves remarkable accuracy (99.26% on CIC-IDS2017 and 99.22% on CSE-CIC-IDS2018), outperforming other models.
Title:
```
  Tracking Emotional Dynamics in Chat Conversations: A Hybrid Approach using DistilBERT and Emoji Sentiment Analysis
```
Authors: Ayan Igali, Abdulkhak Abdrakhman, Yerdaut Torekhan, Pakizar Shamoi
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Computer-mediated communication has become more important than face-to-face communication in many contexts. Tracking emotional dynamics in chat conversations can enhance communication, improve services, and support well-being in various contexts. This paper explores a hybrid approach to tracking emotional dynamics in chat conversations by combining DistilBERT-based text emotion detection and emoji sentiment analysis. A Twitter dataset was analyzed using various machine learning algorithms, including SVM, Random Forest, and AdaBoost. We contrasted their performance with DistilBERT. Results reveal DistilBERT's superior performance in emotion recognition. Our approach accounts for emotive expressions conveyed through emojis to better understand participants' emotions during chats. We demonstrate how this approach can effectively capture and analyze emotional shifts in real-time conversations. Our findings show that integrating text and emoji analysis is an effective way of tracking chat emotion, with possible applications in customer service, work chats, and social media interactions.
Title:
```
  BEVPlace++: Fast, Robust, and Lightweight LiDAR Global Localization for Unmanned Ground Vehicles
```
Authors: Lun Luo, Siyuan Cao, Xiaorui Li, Jintao Xu, Rui Ai, Zhu Yu, Xieyuanli Chen
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This article introduces BEVPlace++, a novel, fast, and robust LiDAR global localization method for unmanned ground vehicles. It uses lightweight convolutional neural networks (CNNs) on Bird's Eye View (BEV) image-like representations of LiDAR data to achieve accurate global localization through place recognition followed by 3-DoF pose estimation. Our detailed analyses reveal an interesting fact that CNNs are inherently effective at extracting distinctive features from LiDAR BEV images. Remarkably, keypoints of two BEV images with large translations can be effectively matched using CNN-extracted features. Building on this insight, we design a rotation equivariant module (REM) to obtain distinctive features while enhancing robustness to rotational changes. A Rotation Equivariant and Invariant Network (REIN) is then developed by cascading REM and a descriptor generator, NetVLAD, to sequentially generate rotation equivariant local features and rotation invariant global descriptors. The global descriptors are used first to achieve robust place recognition, and the local features are used for accurate pose estimation. Experimental results on multiple public datasets demonstrate that BEVPlace++, even when trained on a small dataset (3000 frames of KITTI) only with place labels, generalizes well to unseen environments, performs consistently across different days and years, and adapts to various types of LiDAR scanners. BEVPlace++ achieves state-of-the-art performance in subtasks of global localization including place recognition, loop closure detection, and global localization. Additionally, BEVPlace++ is lightweight, runs in real-time, and does not require accurate pose supervision, making it highly convenient for deployment. The source codes are publicly available at \href{this https URL}{this https URL}.
Title:
```
  Supervised Image Translation from Visible to Infrared Domain for Object Detection
```
Authors: Prahlad Anand, Qiranul Saadiyean, Aniruddh Sikdar, Nalini N, Suresh Sundaram
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This study aims to learn a translation from visible to infrared imagery, bridging the domain gap between the two modalities so as to improve accuracy on downstream tasks including object detection. Previous approaches attempt to perform bi-domain feature fusion through iterative optimization or end-to-end deep convolutional networks. However, we pose the problem as similar to that of image translation, adopting a two-stage training strategy with a Generative Adversarial Network and an object detection model. The translation model learns a conversion that preserves the structural detail of visible images while preserving the texture and other characteristics of infrared images. Images so generated are used to train standard object detection frameworks including Yolov5, Mask and Faster RCNN. We also investigate the usefulness of integrating a super-resolution step into our pipeline to further improve model accuracy, and achieve an improvement of as high as 5.3% mAP.
Title:
```
  CAF-YOLO: A Robust Framework for Multi-Scale Lesion Detection in Biomedical Imagery
```
Authors: Zilin Chen, Shengnan Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Object detection is of paramount importance in biomedical image analysis, particularly for lesion identification. While current methodologies are proficient in identifying and pinpointing lesions, they often lack the precision needed to detect minute biomedical entities (e.g., abnormal cells, lung nodules smaller than 3 mm), which are critical in blood and lung pathology. To address this challenge, we propose CAF-YOLO, based on the YOLOv8 architecture, a nimble yet robust method for medical object detection that leverages the strengths of convolutional neural networks (CNNs) and transformers. To overcome the limitation of convolutional kernels, which have a constrained capacity to interact with distant information, we introduce an attention and convolution fusion module (ACFM). This module enhances the modeling of both global and local features, enabling the capture of long-term feature dependencies and spatial autocorrelation. Additionally, to improve the restricted single-scale feature aggregation inherent in feed-forward networks (FFN) within transformer architectures, we design a multi-scale neural network (MSNN). This network improves multi-scale information aggregation by extracting features across diverse scales. Experimental evaluations on widely used datasets, such as BCCD and LUNA16, validate the rationale and efficacy of CAF-YOLO. This methodology excels in detecting and precisely locating diverse and intricate micro-lesions within biomedical imagery. Our codes are available at this https URL.
Title:
```
  Scaling Symbolic Execution to Large Software Systems
```
Authors: Gabor Horvath, Reka Kovacs, Zoltan Porkolab
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Static analysis is the analysis of a program without executing it, usually carried out by an automated tool. Symbolic execution is a popular static analysis technique used both in program verification and in bug detection software. It works by interpreting the code, introducing a symbol for each value unknown at compile time (e.g. user-given inputs), and carrying out calculations symbolically. The analysis engine strives to explore multiple execution paths simultaneously, although checking all paths is an intractable problem, due to the vast number of possibilities. We focus on an error finding framework called the Clang Static Analyzer, and the infrastructure built around it named CodeChecker. The emphasis is on achieving end-to-end scalability. This includes the run time and memory consumption of the analysis, bug presentation to the users, automatic false positive suppression, incremental analysis, pattern discovery in the results, and usage in continuous integration loops. We also outline future directions and open problems concerning these tools. While a rich literature exists on program verification software, error finding tools normally need to settle for survey papers on individual techniques. In this paper, we not only discuss individual methods, but also how these decisions interact and reinforce each other, creating a system that is greater than the sum of its parts. Although the Clang Static Analyzer can only handle C-family languages, the techniques introduced in this paper are mostly language-independent and applicable to other similar static analysis tools.
Title:
```
  A Survey and Evaluation of Adversarial Attacks for Object Detection
```
Authors: Khoi Nguyen Tiet Nguyen, Wenyu Zhang, Kangkang Lu, Yuhuan Wu, Xingjian Zheng, Hui Li Tan, Liangli Zhen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Deep learning models excel in various computer vision tasks but are susceptible to adversarial examples-subtle perturbations in input data that lead to incorrect predictions. This vulnerability poses significant risks in safety-critical applications such as autonomous vehicles, security surveillance, and aircraft health monitoring. While numerous surveys focus on adversarial attacks in image classification, the literature on such attacks in object detection is limited. This paper offers a comprehensive taxonomy of adversarial attacks specific to object detection, reviews existing adversarial robustness evaluation metrics, and systematically assesses open-source attack methods and model robustness. Key observations are provided to enhance the understanding of attack effectiveness and corresponding countermeasures. Additionally, we identify crucial research challenges to guide future efforts in securing automated object detection systems.
Title:
```
  AnomalySD: Few-Shot Multi-Class Anomaly Detection with Stable Diffusion Model
```
Authors: Zhenyu Yan, Qingqing Fang, Wenxi Lv, Qinliang Su
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, which incurs heavy costs and lacks flexibility in practice. To address these issues, we seek help from Stable Diffusion (SD) model due to its capability of zero/few-shot inpainting, which can be leveraged to inpaint anomalous regions as normal. In this paper, a few-shot multi-class anomaly detection framework that adopts Stable Diffusion model is proposed, named AnomalySD. To adapt SD to anomaly detection task, we design different hierarchical text descriptions and the foreground mask mechanism for fine-tuning SD. In the inference stage, to accurately mask anomalous regions for inpainting, we propose multi-scale mask strategy and prototype-guided mask strategy to handle diverse anomalous regions. Hierarchical text prompts are also utilized to guide the process of inpainting in the inference stage. The anomaly score is estimated based on inpainting result of all masks. Extensive experiments on the MVTec-AD and VisA datasets demonstrate the superiority of our approach. We achieved anomaly classification and segmentation results of 93.6%/94.8% AUROC on the MVTec-AD dataset and 86.1%/96.5% AUROC on the VisA dataset under multi-class and one-shot settings.
Title:
```
  SR-CIS: Self-Reflective Incremental System with Decoupled Memory and Reasoning
```
Authors: Biqing Qi, Junqi Gao, Xinquan Chen, Dong Li, Weinan Zhang, Bowen Zhou
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The ability of humans to rapidly learn new knowledge while retaining old memories poses a significant challenge for current deep learning models. To handle this challenge, we draw inspiration from human memory and learning mechanisms and propose the Self-Reflective Complementary Incremental System (SR-CIS). Comprising the deconstructed Complementary Inference Module (CIM) and Complementary Memory Module (CMM), SR-CIS features a small model for fast inference and a large model for slow deliberation in CIM, enabled by the Confidence-Aware Online Anomaly Detection (CA-OAD) mechanism for efficient collaboration. CMM consists of task-specific Short-Term Memory (STM) region and a universal Long-Term Memory (LTM) region. By setting task-specific Low-Rank Adaptive (LoRA) and corresponding prototype weights and biases, it instantiates external storage for parameter and representation memory, thus deconstructing the memory module from the inference module. By storing textual descriptions of images during training and combining them with the Scenario Replay Module (SRM) post-training for memory combination, along with periodic short-to-long-term memory restructuring, SR-CIS achieves stable incremental memory with limited storage requirements. Balancing model plasticity and memory stability under constraints of limited storage and low data resources, SR-CIS surpasses existing competitive baselines on multiple standard and few-shot incremental learning benchmarks.
Title:
```
  Single-Point Supervised High-Resolution Dynamic Network for Infrared Small Target Detection
```
Authors: Jing Wu, Rixiang Ni, Feng Huang, Zhaobing Qiu, Liqiong Chen, Changhai Luo, Yunxiang Li, Youli Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Infrared small target detection (IRSTD) tasks are extremely challenging for two main reasons: 1) it is difficult to obtain accurate labelling information that is critical to existing methods, and 2) infrared (IR) small target information is easily lost in deep networks. To address these issues, we propose a single-point supervised high-resolution dynamic network (SSHD-Net). In contrast to existing methods, we achieve state-of-the-art (SOTA) detection performance using only single-point supervision. Specifically, we first design a high-resolution cross-feature extraction module (HCEM), that achieves bi-directional feature interaction through stepped feature cascade channels (SFCC). It balances network depth and feature resolution to maintain deep IR small-target information. Secondly, the effective integration of global and local features is achieved through the dynamic coordinate fusion module (DCFM), which enhances the anti-interference ability in complex backgrounds. In addition, we introduce the high-resolution multilevel residual module (HMRM) to enhance the semantic information extraction capability. Finally, we design the adaptive target localization detection head (ATLDH) to improve detection accuracy. Experiments on the publicly available datasets NUDT-SIRST and IRSTD-1k demonstrate the effectiveness of our method. Compared to other SOTA methods, our method can achieve better detection performance with only a single point of supervision.
Title:
```
  AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning
```
Authors: Xin Wang, Kai Chen, Xingjun Ma, Zhineng Chen, Jingjing Chen, Yu-Gang Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks even under a black-box setting where the adversary can only query the model. Particularly, query-based black-box adversarial attacks estimate adversarial gradients based on the returned probability vectors of the target model for a sequence of queries. During this process, the queries made to the target model are intermediate adversarial examples crafted at the previous attack step, which share high similarities in the pixel space. Motivated by this observation, stateful detection methods have been proposed to detect and reject query-based attacks. While demonstrating promising results, these methods either have been evaded by more advanced attacks or suffer from low efficiency in terms of the number of shots (queries) required to detect different attacks. Arguably, the key challenge here is to assign high similarity scores for any two intermediate adversarial examples perturbed from the same clean image. To address this challenge, we propose a novel Adversarial Contrastive Prompt Tuning (ACPT) method to robustly fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. With ACPT, we further introduce a detection framework AdvQDet that can detect 7 state-of-the-art query-based attacks with $>99\%$ detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks. Code is available at this https URL.
Title:
```
  MetaWearS: A Shortcut in Wearable Systems Lifecycle with Only a Few Shots
```
Authors: Alireza Amirshahi, Maedeh H.Toosi, Siamak Mohammadi, Stefano Albini, Pasquale Davide Schiavone, Giovanni Ansaloni, Amir Aminifar, David Atienza
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Wearable systems provide continuous health monitoring and can lead to early detection of potential health issues. However, the lifecycle of wearable systems faces several challenges. First, effective model training for new wearable devices requires substantial labeled data from various subjects collected directly by the wearable. Second, subsequent model updates require further extensive labeled data for retraining. Finally, frequent model updating on the wearable device can decrease the battery life in long-term data monitoring. Addressing these challenges, in this paper, we propose MetaWearS, a meta-learning method to reduce the amount of initial data collection required. Moreover, our approach incorporates a prototypical updating mechanism, simplifying the update process by modifying the class prototype rather than retraining the entire model. We explore the performance of MetaWearS in two case studies, namely, the detection of epileptic seizures and the detection of atrial fibrillation. We show that by fine-tuning with just a few samples, we achieve 70% and 82% AUC for the detection of epileptic seizures and the detection of atrial fibrillation, respectively. Compared to a conventional approach, our proposed method performs better with up to 45% AUC. Furthermore, updating the model with only 16 minutes of additional labeled data increases the AUC by up to 5.3%. Finally, MetaWearS reduces the energy consumption for model updates by 456x and 418x for epileptic seizure and AF detection, respectively.
Title:
```
  Towards Automatic Hands-on-Keyboard Attack Detection Using LLMs in EDR Solutions
```
Authors: Amit Portnoy, Ehud Azikri, Shay Kels
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Endpoint Detection and Remediation (EDR) platforms are essential for identifying and responding to cyber threats. This study presents a novel approach using Large Language Models (LLMs) to detect Hands-on-Keyboard (HOK) cyberattacks. Our method involves converting endpoint activity data into narrative forms that LLMs can analyze to distinguish between normal operations and potential HOK attacks. We address the challenges of interpreting endpoint data by segmenting narratives into windows and employing a dual training strategy. The results demonstrate that LLM-based models have the potential to outperform traditional machine learning methods, offering a promising direction for enhancing EDR capabilities and apply LLMs in cybersecurity.
Title:
```
  Isolating Signatures of Cyberattacks under Stressed Grid Conditions
```
Authors: Sanchita Ghosh, Syed Ahsan Raza Naqvi, Sai Pushpak Nandanoori, Soumya Kundu
Subjects: Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In a controlled cyber-physical network, such as a power grid, any malicious data injection in the sensor measurements can lead to widespread impact due to the actions of the closed-loop controllers. While fast identification of the attack signatures is imperative for reliable operations, it is challenging to do so in a large dynamical network with tightly coupled nodes. A particularly challenging scenario arises when the cyberattacks are strategically launched during a grid stress condition, caused by non-malicious physical disturbances. In this work, we propose an algorithmic framework -- based on Koopman mode (KM) decomposition -- for online identification and visualization of the cyberattack signatures in streaming time-series measurements from a power network. The KMs are capable of capturing the spatial embedding of both natural and anomalous modes of oscillations in the sensor measurements and thus revealing the specific influences of cyberattacks, even under existing non-malicious grid stress events. Most importantly, it enables us to quantitatively compare the outcomes of different potential cyberattacks injected by an attacker. The performance of the proposed algorithmic framework is illustrated on the IEEE 68-bus test system using synthetic attack scenarios. Such knowledge regarding the detection of various cyberattacks will enable us to devise appropriate diagnostic scheme while considering varied constraints arising from different attacks.
Title:
```
  Individualized multi-horizon MRI trajectory prediction for Alzheimer's Disease
```
Authors: Rosemary He, Gabriella Ang, Daniel Tward
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Neurodegeneration as measured through magnetic resonance imaging (MRI) is recognized as a potential biomarker for diagnosing Alzheimer's disease (AD), but is generally considered less specific than amyloid or tau based biomarkers. Due to a large amount of variability in brain anatomy between different individuals, we hypothesize that leveraging MRI time series can help improve specificity, by treating each patient as their own baseline. Here we turn to conditional variational autoencoders to generate individualized MRI predictions given the subject's age, disease status and one previous scan. Using serial imaging data from the Alzheimer's Disease Neuroimaging Initiative, we train a novel architecture to build a latent space distribution which can be sampled from to generate future predictions of changing anatomy. This enables us to extrapolate beyond the dataset and predict MRIs up to 10 years. We evaluated the model on a held-out set from ADNI and an independent dataset (from Open Access Series of Imaging Studies). By comparing to several alternatives, we show that our model produces more individualized images with higher resolution. Further, if an individual already has a follow-up MRI, we demonstrate a usage of our model to compute a likelihood ratio classifier for disease status. In practice, the model may be able to assist in early diagnosis of AD and provide a counterfactual baseline trajectory for treatment effect estimation. Furthermore, it generates a synthetic dataset that can potentially be used for downstream tasks such as anomaly detection and classification.
Title:
```
  Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion
```
Authors: Pooya Janani (1), Amirabolfazl Suratgar (1), Afshin Taghvaeipour (2) ((1) Distributed and Intelligent Optimization Research Laboratory, Dept. of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran, (2) Dept. of Mechanical Engineering, Amirkabir University of Technology, Tehran, Iran)
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper proposes a hybrid fusion-based deep learning approach based on two different modalities, audio and video, to improve human activity recognition and violence detection in public places. To take advantage of audiovisual fusion, late fusion, intermediate fusion, and hybrid fusion-based deep learning (HFBDL) are used and compared. Since the objective is to detect and recognize human violence in public places, Real-life violence situation (RLVS) dataset is expanded and used. Simulating results of HFBDL show 96.67\% accuracy on validation data, which is more accurate than the other state-of-the-art methods on this dataset. To showcase our model's ability in real-world scenarios, another dataset of 54 sounded videos of both violent and non-violent situations was recorded. The model could successfully detect 52 out of 54 videos correctly. The proposed method shows a promising performance on real scenarios. Thus, it can be used for human action recognition and violence detection in public places for security purposes.
Title:
```
  Robustness of Watermarking on Text-to-Image Diffusion Models
```
Authors: Xiaodong Wu, Xiangman Li, Jianbing Ni
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Watermarking has become one of promising techniques to not only aid in identifying AI-generated images but also serve as a deterrent against the unethical use of these models. However, the robustness of watermarking techniques has not been extensively studied recently. In this paper, we investigate the robustness of generative watermarking, which is created from the integration of watermarking embedding and text-to-image generation processing in generative models, e.g., latent diffusion models. Specifically, we propose three attacking methods, i.e., discriminator-based attacks, edge prediction-based attacks, and fine-tune-based attacks, under the scenario where the watermark decoder is not accessible. The model is allowed to be fine-tuned to created AI agents with specific generative tasks for personalizing or specializing. We found that generative watermarking methods are robust to direct evasion attacks, like discriminator-based attacks, or manipulation based on the edge information in edge prediction-based attacks but vulnerable to malicious fine-tuning. Experimental results show that our fine-tune-based attacks can decrease the accuracy of the watermark detection to nearly $67.92\%$. In addition, We conduct an ablation study on the length of fine-tuned messages, encoder/decoder's depth and structure to identify key factors that impact the performance of fine-tune-based attacks.
Title:
```
  EOL: Transductive Few-Shot Open-Set Recognition by Enhancing Outlier Logits
```
Authors: Mateusz Ochal, Massimiliano Patacchiola, Malik Boudiaf, Sen Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In Few-Shot Learning (FSL), models are trained to recognise unseen objects from a query set, given a few labelled examples from a support set. In standard FSL, models are evaluated on query instances sampled from the same class distribution of the support set. In this work, we explore the more nuanced and practical challenge of Open-Set Few-Shot Recognition (OSFSL). Unlike standard FSL, OSFSL incorporates unknown classes into the query set, thereby requiring the model not only to classify known classes but also to identify outliers. Building on the groundwork laid by previous studies, we define a novel transductive inference technique that leverages the InfoMax principle to exploit the unlabelled query set. We called our approach the Enhanced Outlier Logit (EOL) method. EOL refines class prototype representations through model calibration, effectively balancing the inlier-outlier ratio. This calibration enhances pseudo-label accuracy for the query set and improves the optimisation objective within the transductive inference process. We provide a comprehensive empirical evaluation demonstrating that EOL consistently surpasses traditional methods, recording performance improvements ranging from approximately $+1.3%$ to $+6.3%$ across a variety of classification and outlier detection metrics and benchmarks, even in the presence of inlier-outlier imbalance.
Title:
```
  PromptSAM+: Malware Detection based on Prompt Segment Anything Model
```
Authors: Xingyuan Wei, Yichen Liu, Ce Li, Ning Li, Degang Sun, Yan Wang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Machine learning and deep learning (ML/DL) have been extensively applied in malware detection, and some existing methods demonstrate robust performance. However, several issues persist in the field of malware detection: (1) Existing work often overemphasizes accuracy at the expense of practicality, rarely considering false positive and false negative rates as important metrics. (2) Considering the evolution of malware, the performance of classifiers significantly declines over time, greatly reducing the practicality of malware detectors. (3) Prior ML/DL-based efforts heavily rely on ample labeled data for model training, largely dependent on feature engineering or domain knowledge to build feature databases, making them vulnerable if correct labels are scarce. With the development of computer vision, vision-based malware detection technology has also rapidly evolved. In this paper, we propose a visual malware general enhancement classification framework, `PromptSAM+', based on a large visual network segmentation model, the Prompt Segment Anything Model(named PromptSAM+). Our experimental results indicate that 'PromptSAM+' is effective and efficient in malware detection and classification, achieving high accuracy and low rates of false positives and negatives. The proposed method outperforms the most advanced image-based malware detection technologies on several datasets. 'PromptSAM+' can mitigate aging in existing image-based malware classifiers, reducing the considerable manpower needed for labeling new malware samples through active learning. We conducted experiments on datasets for both Windows and Android platforms, achieving favorable outcomes. Additionally, our ablation experiments on several datasets demonstrate that our model identifies effective modules within the large visual network.
Title:
```
  KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for autonomous driving
```
Authors: Zhihao Lai, Chuanhao Liu, Shihui Sheng, Zhiqiang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Accurate 3D object detection in autonomous driving is critical yet challenging due to occlusions, varying object scales, and complex urban environments. This paper introduces the RCBEV-KAN algorithm, a pioneering method designed to enhance 3D object detection by fusing multimodal sensor data from cameras, LiDAR, and millimeter-wave radar. Our innovative Bird's Eye View (BEV)-based approach, utilizing a Transformer architecture, significantly boosts detection precision and efficiency by seamlessly integrating diverse data sources, improving spatial relationship handling, and optimizing computational processes. Experimental results show that the RCBEV-KAN model demonstrates superior performance across most detection categories, achieving higher Mean Distance AP (0.389 vs. 0.316, a 23% improvement), better ND Score (0.484 vs. 0.415, a 17% improvement), and faster Evaluation Time (71.28s, 8% faster). These results indicate that RCBEV-KAN is more accurate, reliable, and efficient, making it ideal for dynamic and challenging autonomous driving environments.
Title:
```
  AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos
```
Authors: Feichi Lu, Zijian Dong, Jie Song, Otmar Hilliges
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.
Title:
```
  Rethinking Affect Analysis: A Protocol for Ensuring Fairness and Consistency
```
Authors: Guanyu Hu, Dimitrios Kollias, Eleni Papadopoulou, Paraskevi Tzouveli, Jie Wei, Xinyu Yang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Evaluating affect analysis methods presents challenges due to inconsistencies in database partitioning and evaluation protocols, leading to unfair and biased results. Previous studies claim continuous performance improvements, but our findings challenge such assertions. Using these insights, we propose a unified protocol for database partitioning that ensures fairness and comparability. We provide detailed demographic annotations (in terms of race, gender and age), evaluation metrics, and a common framework for expression recognition, action unit detection and valence-arousal estimation. We also rerun the methods with the new protocol and introduce a new leaderboards to encourage future research in affect recognition with a fairer comparison. Our annotations, code, and pre-trained models are available on \hyperlink{this https URL}{Github}.
Title:
```
  AssemAI: Interpretable Image-Based Anomaly Detection for Manufacturing Pipelines
```
Authors: Renjith Prasad, Chathurangi Shyalika, Ramtin Zand, Fadi El Kalach, Revathy Venkataramanan, Ramy Harik, Amit Sheth
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Anomaly detection in manufacturing pipelines remains a critical challenge, intensified by the complexity and variability of industrial environments. This paper introduces AssemAI, an interpretable image-based anomaly detection system tailored for smart manufacturing pipelines. Our primary contributions include the creation of a tailored image dataset and the development of a custom object detection model, YOLO-FF, designed explicitly for anomaly detection in manufacturing assembly environments. Utilizing the preprocessed image dataset derived from an industry-focused rocket assembly pipeline, we address the challenge of imbalanced image data and demonstrate the importance of image-based methods in anomaly detection. The proposed approach leverages domain knowledge in data preparation, model development and reasoning. We compare our method against several baselines, including simple CNN and custom Visual Transformer (ViT) models, showcasing the effectiveness of our custom data preparation and pretrained CNN integration. Additionally, we incorporate explainability techniques at both user and model levels, utilizing ontology for user-friendly explanations and SCORE-CAM for in-depth feature and model analysis. Finally, the model was also deployed in a real-time setting. Our results include ablation studies on the baselines, providing a comprehensive evaluation of the proposed system. This work highlights the broader impact of advanced image-based anomaly detection in enhancing the reliability and efficiency of smart manufacturing processes.
Title:
```
  Dense Feature Interaction Network for Image Inpainting Localization
```
Authors: Ye Yao, Tingfeng Han, Shan Jia, Siwei Lyu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Image inpainting, which is the task of filling in missing areas in an image, is a common image editing technique. Inpainting can be used to conceal or alter image contents in malicious manipulation of images, driving the need for research in image inpainting detection. Existing methods mostly rely on a basic encoder-decoder structure, which often results in a high number of false positives or misses the inpainted regions, especially when dealing with targets of varying semantics and scales. Additionally, the absence of an effective approach to capture boundary artifacts leads to less accurate edge localization. In this paper, we describe a new method for inpainting detection based on a Dense Feature Interaction Network (DeFI-Net). DeFI-Net uses a novel feature pyramid architecture to capture and amplify multi-scale representations across various stages, thereby improving the detection of image inpainting by better revealing feature-level interactions. Additionally, the network can adaptively direct the lower-level features, which carry edge and shape information, to refine the localization of manipulated regions while integrating the higher-level semantic features. Using DeFI-Net, we develop a method combining complementary representations to accurately identify inpainted areas. Evaluation on five image inpainting datasets demonstrate the effectiveness of our approach, which achieves state-of-the-art performance in detecting inpainting across diverse models.
Title:
```
  Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets
```
Authors: Lucas Choi, Ross Greer
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Motorcycle accidents pose significant risks, particularly when riders and passengers do not wear helmets. This study evaluates the efficacy of an advanced vision-language foundation model, OWLv2, in detecting and classifying various helmet-wearing statuses of motorcycle occupants using video data. We extend the dataset provided by the CVPR AI City Challenge and employ a cascaded model approach for detection and classification tasks, integrating OWLv2 and CNN models. The results highlight the potential of zero-shot learning to address challenges arising from incomplete and biased training datasets, demonstrating the usage of such models in detecting motorcycles, helmet usage, and occupant positions under varied conditions. We have achieved an average precision of 0.5324 for helmet detection and provided precision-recall curves detailing the detection and classification performance. Despite limitations such as low-resolution data and poor visibility, our research shows promising advancements in automated vehicle safety and traffic safety enforcement systems.
Title:
```
  Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
```
Authors: Shuhao Guan, Derek Greene
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.
Title:
```
  One-Shot Collaborative Data Distillation
```
Authors: Rayne Holland, Chandra Thapa, Sarah Ali Siddiqui, Wei Shao, Seyit Camtepe
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Large machine-learning training datasets can be distilled into small collections of informative synthetic data samples. These synthetic sets support efficient model learning and reduce the communication cost of data sharing. Thus, high-fidelity distilled data can support the efficient deployment of machine learning applications in distributed network environments. A naive way to construct a synthetic set in a distributed environment is to allow each client to perform local data distillation and to merge local distillations at a central server. However, the quality of the resulting set is impaired by heterogeneity in the distributions of the local data held by clients. To overcome this challenge, we introduce the first collaborative data distillation technique, called CollabDM, which captures the global distribution of the data and requires only a single round of communication between client and server. Our method outperforms the state-of-the-art one-shot learning method on skewed data in distributed learning environments. We also show the promising practical benefits of our method when applied to attack detection in 5G networks.
Title:
```
  Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization
```
Authors: Changtao Miao, Qi Chu, Tao Gong, Zhentao Tan, Zhenchao Jin, Wanyi Zhuang, Man Luo, Honggang Hu, Nenghai Yu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract With the advancement of face manipulation technology, forgery images in multi-face scenarios are gradually becoming a more complex and realistic challenge. Despite this, detection and localization methods for such multi-face manipulations remain underdeveloped. Traditional manipulation localization methods either indirectly derive detection results from localization masks, resulting in limited detection performance, or employ a naive two-branch structure to simultaneously obtain detection and localization results, which cannot effectively benefit the localization capability due to limited interaction between two tasks. This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The MoNFAP primarily introduces two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM). The FUP integrates detection and localization tasks using a token learning strategy and multiple forgery-aware transformers, which facilitates the use of classification information to enhance localization capability. Besides, motivated by the crucial role of noise information in forgery detection, the MNM leverages multiple noise extractors based on the concept of the mixture of experts to enhance the general RGB features, further boosting the performance of our framework. Finally, we establish a comprehensive benchmark for multi-face detection and localization and the proposed \textit{MoNFAP} achieves significant performance. The codes will be made available.
Title:
```
  On the Robustness of Malware Detectors to Adversarial Samples
```
Authors: Muhammad Salman, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Muhammad Ikram, Sidharth Kaushik, Mohamed Ali Kaafar
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Adversarial examples add imperceptible alterations to inputs with the objective to induce misclassification in machine learning models. They have been demonstrated to pose significant challenges in domains like image classification, with results showing that an adversarially perturbed image to evade detection against one classifier is most likely transferable to other classifiers. Adversarial examples have also been studied in malware analysis. Unlike images, program binaries cannot be arbitrarily perturbed without rendering them non-functional. Due to the difficulty of crafting adversarial program binaries, there is no consensus on the transferability of adversarially perturbed programs to different detectors. In this work, we explore the robustness of malware detectors against adversarially perturbed malware. We investigate the transferability of adversarial attacks developed against one detector, against other machine learning-based malware detectors, and code similarity techniques, specifically, locality sensitive hashing-based detectors. Our analysis reveals that adversarial program binaries crafted for one detector are generally less effective against others. We also evaluate an ensemble of detectors and show that they can potentially mitigate the impact of adversarial program binaries. Finally, we demonstrate that substantial program changes made to evade detection may result in the transformation technique being identified, implying that the adversary must make minimal changes to the program binary.
Title:
```
  Optimization of Iterative Blind Detection based on Expectation Maximization and Belief Propagation
```
Authors: Luca Schmid, Tomer Raviv, Nir Shlezinger, Laurent Schmalen
Subjects: Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We study iterative blind symbol detection for block-fading linear inter-symbol interference channels. Based on the factor graph framework, we design a joint channel estimation and detection scheme that combines the expectation maximization (EM) algorithm and the ubiquitous belief propagation (BP) algorithm. Interweaving the iterations of both schemes significantly reduces the EM algorithm's computational burden while retaining its excellent performance. To this end, we apply simple yet effective model-based learning methods to find a suitable parameter update schedule by introducing momentum in both the EM parameter updates as well as in the BP message passing. Numerical simulations verify that the proposed method can learn efficient schedules that generalize well and even outperform coherent BP detection in high signal-to-noise scenarios.
Title:
```
  A Lean Transformer Model for Dynamic Malware Analysis and Detection
```
Authors: Tony Quertier, Benjamin Marais, Grégoire Barrué, Stéphane Morucci, Sévan Azé, Sébastien Salladin
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue. This is mainly due to the fact that many prevention solutions rely on signature-based detection methods that can easily be circumvented by hackers. Therefore, there is a recurrent need for behavior-based analysis where a suspicious file is ran in a secured environment and its traces are collected to reports for analysis. Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from these execution reports. Recently, Large Language Models and Generative AI have demonstrated impressive capabilities mainly in Natural Language Processing tasks and promising applications in the cybersecurity field for both attackers and defenders. In this paper, we design an Encoder-Only model, based on the Transformers architecture, to detect malicious files, digesting their API call sequences collected by an execution emulation solution. We are also limiting the size of the model architecture and the number of its parameters since it is often considered that Large Language Models may be overkill for specific tasks such as the one we are dealing with hereafter. In addition to achieving decent detection results, this approach has the advantage of reducing our carbon footprint by limiting training and inference times and facilitating technical operations with less hardware requirements. We also carry out some analysis of our results and highlight the limits and possible improvements when using Transformers to analyze malicious files.
Title:
```
  From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection
```
Authors: Syafiq Al Atiiq, Christian Gehrmann, Kevin Dahlén, Karim Khalil
Subjects: Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Vulnerability Detection (VD) using machine learning faces a significant challenge: the vast diversity of vulnerability types. Each Common Weakness Enumeration (CWE) represents a unique category of vulnerabilities with distinct characteristics, code semantics, and patterns. Treating all vulnerabilities as a single label with a binary classification approach may oversimplify the problem, as it fails to capture the nuances and context-specific to each CWE. As a result, a single binary classifier might merely rely on superficial text patterns rather than understanding the intricacies of each vulnerability type. Recent reports showed that even the state-of-the-art Large Language Model (LLM) with hundreds of billions of parameters struggles to generalize well to detect vulnerabilities. Our work investigates a different approach that leverages CWE-specific classifiers to address the heterogeneity of vulnerability types. We hypothesize that training separate classifiers for each CWE will enable the models to capture the unique characteristics and code semantics associated with each vulnerability category. To confirm this, we conduct an ablation study by training individual classifiers for each CWE and evaluating their performance independently. Our results demonstrate that CWE-specific classifiers outperform a single binary classifier trained on all vulnerabilities. Building upon this, we explore strategies to combine them into a unified vulnerability detection system using a multiclass approach. Even if the lack of large and high-quality datasets for vulnerability detection is still a major obstacle, our results show that multiclass detection can be a better path toward practical vulnerability detection in the future. All our models and code to produce our results are open-sourced.
Title:
```
  Machine Learning Applications in Medical Prognostics: A Comprehensive Review
```
Authors: Michael Fascia
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Machine learning (ML) has revolutionized medical prognostics by integrating advanced algorithms with clinical data to enhance disease prediction, risk assessment, and patient outcome forecasting. This comprehensive review critically examines the application of various ML techniques in medical prognostics, focusing on their efficacy, challenges, and future directions. The methodologies discussed include Random Forest (RF) for sepsis prediction, logistic regression for cardiovascular risk assessment, Convolutional Neural Networks (CNNs) for cancer detection, and Long Short-Term Memory (LSTM) networks for predicting clinical deterioration. RF models demonstrate robust performance in handling high-dimensional data and capturing non-linear relationships, making them particularly effective for sepsis prediction. Logistic regression remains valuable for its interpretability and ease of use in cardiovascular risk assessment. CNNs have shown exceptional accuracy in cancer detection, leveraging their ability to learn complex visual patterns from medical imaging. LSTM networks excel in analyzing temporal data, providing accurate predictions of clinical deterioration. The review highlights the strengths and limitations of each technique, the importance of model interpretability, and the challenges of data quality and privacy. Future research directions include the integration of multi-modal data sources, the application of transfer learning, and the development of continuous learning systems. These advancements aim to enhance the predictive power and clinical applicability of ML models, ultimately improving patient outcomes in healthcare settings.
Title:
```
  Tensorial template matching for fast cross-correlation with rotations and its application for tomography
```
Authors: Antonio Martinez-Sanchez (1), Ulrike Homberg (2), José María Almira (1), Harold Phelippeau (2) ((1) University of Murcia, Spain, (2) Thermo Fisher Scientific)
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Object detection is a main task in computer vision. Template matching is the reference method for detecting objects with arbitrary templates. However, template matching computational complexity depends on the rotation accuracy, being a limiting factor for large 3D images (tomograms). Here, we implement a new algorithm called tensorial template matching, based on a mathematical framework that represents all rotations of a template with a tensor field. Contrary to standard template matching, the computational complexity of the presented algorithm is independent of the rotation accuracy. Using both, synthetic and real data from tomography, we demonstrate that tensorial template matching is much faster than template matching and has the potential to improve its accuracy
Title:
```
  From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
```
Authors: Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, Huaming Chen
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract With the rise of large language models (LLMs), researchers are increasingly exploring their applications in var ious vertical domains, such as software engineering. LLMs have achieved remarkable success in areas including code generation and vulnerability detection. However, they also exhibit numerous limitations and shortcomings. LLM-based agents, a novel tech nology with the potential for Artificial General Intelligence (AGI), combine LLMs as the core for decision-making and action-taking, addressing some of the inherent limitations of LLMs such as lack of autonomy and self-improvement. Despite numerous studies and surveys exploring the possibility of using LLMs in software engineering, it lacks a clear distinction between LLMs and LLM based agents. It is still in its early stage for a unified standard and benchmarking to qualify an LLM solution as an LLM-based agent in its domain. In this survey, we broadly investigate the current practice and solutions for LLMs and LLM-based agents for software engineering. In particular we summarise six key topics: requirement engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance. We review and differentiate the work of LLMs and LLM-based agents from these six topics, examining their differences and similarities in tasks, benchmarks, and evaluation metrics. Finally, we discuss the models and benchmarks used, providing a comprehensive analysis of their applications and effectiveness in software engineering. We anticipate this work will shed some lights on pushing the boundaries of LLM-based agents in software engineering for future research.
Title:
```
  Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection
```
Authors: Ting Lei, Shaofeng Yin, Yuxin Peng, Yang Liu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction and generalizable interaction classification, respectively. Specifically, we integrate prior knowledge of different granularity into conditional vision prompts, including an input-conditioned instance prior and a global spatial pattern prior. The former encourages the image encoder to treat instances belonging to seen or potentially unseen HOI concepts equally while the latter provides representative plausible spatial configuration of the human and object under interaction. Besides, we employ language-aware prompt learning with a consistency constraint to preserve the knowledge of the large foundation model to enable better generalization in the text branch. Extensive experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings. The code and models are available at \url{this https URL}.
Title:
```
  Estimating Pore Location of PBF-LB/M Processes with Segmentation Models
```
Authors: Hans Aoyang Zhou, Jan Theunissen, Marco Kemmerling, Anas Abdelrazeq, Johannes Henrich Schleifenbaum, Robert H. Schmitt
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Reliably manufacturing defect free products is still an open challenge for Laser Powder Bed Fusion processes. Particularly, pores that occur frequently have a negative impact on mechanical properties like fatigue performance. Therefore, an accurate localisation of pores is mandatory for quality assurance, but requires time-consuming post-processing steps like computer tomography scans. Although existing solutions using in-situ monitoring data can detect pore occurrence within a layer, they are limited in their localisation precision. Therefore, we propose a pore localisation approach that estimates their position within a single layer using a Gaussian kernel density estimation. This allows segmentation models to learn the correlation between in-situ monitoring data and the derived probability distribution of pore occurrence. Within our experiments, we compare the prediction performance of different segmentation models depending on machine parameter configuration and geometry features. From our results, we conclude that our approach allows a precise localisation of pores that requires minimal data preprocessing. Our research extends the literature by providing a foundation for more precise pore detection systems.
Title:
```
  Introducing a Comprehensive, Continuous, and Collaborative Survey of Intrusion Detection Datasets
```
Authors: Philipp Bönninghausen, Rafael Uetz, Martin Henze
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Researchers in the highly active field of intrusion detection largely rely on public datasets for their experimental evaluations. However, the large number of existing datasets, the discovery of previously unknown flaws therein, and the frequent publication of new datasets make it hard to select suitable options and sufficiently understand their respective limitations. Hence, there is a great risk of drawing invalid conclusions from experimental results with respect to detection performance of novel methods in the real world. While there exist various surveys on intrusion detection datasets, they have deficiencies in providing researchers with a profound decision basis since they lack comprehensiveness, actionable details, and up-to-dateness. In this paper, we present COMIDDS, an ongoing effort to comprehensively survey intrusion detection datasets with an unprecedented level of detail, implemented as a website backed by a public GitHub repository. COMIDDS allows researchers to quickly identify suitable datasets depending on their requirements and provides structured and critical information on each dataset, including actual data samples and links to relevant publications. COMIDDS is freely accessible, regularly updated, and open to contributions.
Title:
```
  Single-tap Latency Reduction with Single- or Double- tap Prediction
```
Authors: Naoto Nishida, Kaori Ikematsu, Junichi Sato, Shota Yamanaka, Kota Tsubouchi
Subjects: Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Touch surfaces are widely utilized for smartphones, tablet PCs, and laptops (touchpad), and single and double taps are the most basic and common operations on them. The detection of single or double taps causes the single-tap latency problem, which creates a bottleneck in terms of the sensitivity of touch inputs. To reduce the single-tap latency, we propose a novel machine-learning-based tap prediction method called PredicTaps. Our method predicts whether a detected tap is a single tap or the first contact of a double tap without having to wait for the hundreds of milliseconds conventionally required. We present three evaluations and one user evaluation that demonstrate its broad applicability and usability for various tap situations on two form factors (touchpad and smartphone). The results showed PredicTaps reduces the single-tap latency from 150-500 ms to 12 ms on laptops and to 17.6 ms on smartphones without reducing usability.
Title:
```
  HQOD: Harmonious Quantization for Object Detection
```
Authors: Long Huang, Zhiwei Dong, Song-Lu Chen, Ruiyao Zhang, Shutong Ti, Feng Chen, Xu-Cheng Yin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Task inharmony problem commonly occurs in modern object detectors, leading to inconsistent qualities between classification and regression tasks. The predicted boxes with high classification scores but poor localization positions or low classification scores but accurate localization positions will worsen the performance of detectors after Non-Maximum Suppression. Furthermore, when object detectors collaborate with Quantization-Aware Training (QAT), we observe that the task inharmony problem will be further exacerbated, which is considered one of the main causes of the performance degradation of quantized detectors. To tackle this issue, we propose the Harmonious Quantization for Object Detection (HQOD) framework, which consists of two components. Firstly, we propose a task-correlated loss to encourage detectors to focus on improving samples with lower task harmony quality during QAT. Secondly, a harmonious Intersection over Union (IoU) loss is incorporated to balance the optimization of the regression branch across different IoU levels. The proposed HQOD can be easily integrated into different QAT algorithms and detectors. Remarkably, on the MS COCO dataset, our 4-bit ATSS with ResNet-50 backbone achieves a state-of-the-art mAP of 39.6%, even surpassing the full-precision one.
Title:
```
  Artificial Intelligence for Public Health Surveillance in Africa: Applications and Opportunities
```
Authors: Jean Marie Tshimula, Mitterrand Kalengayi, Dieumerci Makenga, Dorcas Lilonge, Marius Asumani, Déborah Madiya, Élie Nkuba Kalonji, Hugues Kanda, René Manassé Galekwa, Josias Kumbu, Hardy Mikese, Grace Tshimula, Jean Tshibangu Muabila, Christian N. Mayemba, D'Jeff K. Nkashama, Kalonji Kalala, Steve Ataky, Tighana Wenge Basele, Mbuyi Mukendi Didier, Selain K. Kasereka, Maximilien V. Dialufuma, Godwill Ilunga Wa Kumwita, Lionel Muyuku, Jean-Paul Kimpesa, Dominique Muteba, Aaron Aruna Abedi, Lambert Mukendi Ntobo, Gloria M. Bundutidi, Désiré Kulimba Mashinda, Emmanuel Kabengele Mpinga, Nathanaël M. Kasoro
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Artificial Intelligence (AI) is revolutionizing various fields, including public health surveillance. In Africa, where health systems frequently encounter challenges such as limited resources, inadequate infrastructure, failed health information systems and a shortage of skilled health professionals, AI offers a transformative opportunity. This paper investigates the applications of AI in public health surveillance across the continent, presenting successful case studies and examining the benefits, opportunities, and challenges of implementing AI technologies in African healthcare settings. Our paper highlights AI's potential to enhance disease monitoring and health outcomes, and support effective public health interventions. The findings presented in the paper demonstrate that AI can significantly improve the accuracy and timeliness of disease detection and prediction, optimize resource allocation, and facilitate targeted public health strategies. Additionally, our paper identified key barriers to the widespread adoption of AI in African public health systems and proposed actionable recommendations to overcome these challenges.
Title:
```
  Operational range bounding of spectroscopy models with anomaly detection
```
Authors: Luís F. Simões, Pierluigi Casale, Marília Felismino, Kai Hou Yip, Ingo P. Waldmann, Giovanna Tinetti, Theresa Lueftinger
Subjects: Subjects: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Safe operation of machine learning models requires architectures that explicitly delimit their operational ranges. We evaluate the ability of anomaly detection algorithms to provide indicators correlated with degraded model performance. By placing acceptance thresholds over such indicators, hard boundaries are formed that define the model's coverage. As a use case, we consider the extraction of exoplanetary spectra from transit light curves, specifically within the context of ESA's upcoming Ariel mission. Isolation Forests are shown to effectively identify contexts where prediction models are likely to fail. Coverage/error trade-offs are evaluated under conditions of data and concept drift. The best performance is seen when Isolation Forests model projections of the prediction model's explainability SHAP values.
Title:
```
  Massive MIMO-OTFS-Based Random Access for Cooperative LEO Satellite Constellations
```
Authors: Boxiao Shen, Yongpeng Wu, Shiqi Gong, Heng Liu, Björn Ottersten, Wenjun Zhang
Subjects: Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper investigates joint device identification, channel estimation, and symbol detection for cooperative multi-satellite-enhanced random access, where orthogonal time-frequency space modulation with the large antenna array is utilized to combat the dynamics of the terrestrial-satellite links (TSLs). We introduce the generalized complex exponential basis expansion model to parameterize TSLs, thereby reducing the pilot overhead. By exploiting the block sparsity of the TSLs in the angular domain, a message passing algorithm is designed for initial channel estimation. Subsequently, we examine two cooperative modes to leverage the spatial diversity within satellite constellations: the centralized mode, where computations are performed at a high-power central server, and the distributed mode, where computations are offloaded to edge satellites with minimal signaling overhead. Specifically, in the centralized mode, device identification is achieved by aggregating backhaul information from edge satellites, and channel estimation and symbol detection are jointly enhanced through a structured approximate expectation propagation (AEP) algorithm. In the distributed mode, edge satellites share channel information and exchange soft information about data symbols, leading to a distributed version of AEP. The introduced basis expansion model for TSLs enables the efficient implementation of both centralized and distributed algorithms via fast Fourier transform. Simulation results demonstrate that proposed schemes significantly outperform conventional algorithms in terms of the activity error rate, the normalized mean squared error, and the symbol error rate. Notably, the distributed mode achieves performance comparable to the centralized mode with only two exchanges of soft information about data symbols within the constellation.
Title:
```
  Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection
```
Authors: Sajal Aggarwal, Ananya Pandey, Dinesh Kumar Vishwakarma
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Sarcasm is a type of irony, characterized by an inherent mismatch between the literal interpretation and the intended connotation. Though sarcasm detection in text has been extensively studied, there are situations in which textual input alone might be insufficient to perceive sarcasm. The inclusion of additional contextual cues, such as images, is essential to recognize sarcasm in social media data effectively. This study presents a novel framework for multimodal sarcasm detection that can process input triplets. Two components of these triplets comprise the input text and its associated image, as provided in the datasets. Additionally, a supplementary modality is introduced in the form of descriptive image captions. The motivation behind incorporating this visual semantic representation is to more accurately capture the discrepancies between the textual and visual content, which are fundamental to the sarcasm detection task. The primary contributions of this study are: (1) a robust textual feature extraction branch that utilizes a cross-lingual language model; (2) a visual feature extraction branch that incorporates a self-regulated residual ConvNet integrated with a lightweight spatially aware attention module; (3) an additional modality in the form of image captions generated using an encoder-decoder architecture capable of reading text embedded in images; (4) distinct attention modules to effectively identify the incongruities between the text and two levels of image representations; (5) multi-level cross-domain semantic incongruity representation achieved through feature fusion. Compared with cutting-edge baselines, the proposed model achieves the best accuracy of 92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and MultiBully datasets.
Title:
```
  YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition
```
Authors: Duc Manh Nguyen Dang, Viet Hang Duong, Jia Ching Wang, Nhan Bui Duc
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this paper, we propose a new framework called YOWOv3, which is an improved version of YOWOv2, designed specifically for the task of Human Action Detection and Recognition. This framework is designed to facilitate extensive experimentation with different configurations and supports easy customization of various components within the model, reducing efforts required for understanding and modifying the code. YOWOv3 demonstrates its superior performance compared to YOWOv2 on two widely used datasets for Human Action Detection and Recognition: UCF101-24 and AVAv2.2. Specifically, the predecessor model YOWOv2 achieves an mAP of 85.2% and 20.3% on UCF101-24 and AVAv2.2, respectively, with 109.7M parameters and 53.6 GFLOPs. In contrast, our model - YOWOv3, with only 59.8M parameters and 39.8 GFLOPs, achieves an mAP of 88.33% and 20.31% on UCF101-24 and AVAv2.2, respectively. The results demonstrate that YOWOv3 significantly reduces the number of parameters and GFLOPs while still achieving comparable performance.
Title:
```
  Command-line Obfuscation Detection using Small Language Models
```
Authors: Vojtech Outrata, Michael Adam Polak, Martin Kopp
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract To avoid detection, adversaries often use command-line obfuscation. There are numerous techniques of the command-line obfuscation, all designed to alter the command-line syntax without affecting its original functionality. This variability forces most security solutions to create an exhaustive enumeration of signatures for even a single pattern. In contrast to using signatures, we have implemented a scalable NLP-based detection method that leverages a custom-trained, small transformer language model that can be applied to any source of execution logs. The evaluation on top of real-world telemetry demonstrates that our approach yields high-precision detections even on high-volume telemetry from a diverse set of environments spanning from universities and businesses to healthcare or finance. The practical value is demonstrated in a case study of real-world samples detected by our model. We show the model's superiority to signatures on established malware known to employ obfuscation and showcase previously unseen obfuscated samples detected by our model.
Title:
```
  Detection of Compromised Functions in a Serverless Cloud Environment
```
Authors: Danielle Lavi, Oleg Brodt, Dudu Mimran, Yuval Elovici, Asaf Shabtai
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Serverless computing is an emerging cloud paradigm with serverless functions at its core. While serverless environments enable software developers to focus on developing applications without the need to actively manage the underlying runtime infrastructure, they open the door to a wide variety of security threats that can be challenging to mitigate with existing methods. Existing security solutions do not apply to all serverless architectures, since they require significant modifications to the serverless infrastructure or rely on third-party services for the collection of more detailed data. In this paper, we present an extendable serverless security threat detection model that leverages cloud providers' native monitoring tools to detect anomalous behavior in serverless applications. Our model aims to detect compromised serverless functions by identifying post-exploitation abnormal behavior related to different types of attacks on serverless functions, and therefore, it is a last line of defense. Our approach is not tied to any specific serverless application, is agnostic to the type of threats, and is adaptable through model adjustments. To evaluate our model's performance, we developed a serverless cybersecurity testbed in an AWS cloud environment, which includes two different serverless applications and simulates a variety of attack scenarios that cover the main security threats faced by serverless functions. Our evaluation demonstrates our model's ability to detect all implemented attacks while maintaining a negligible false alarm rate.
Keyword: face recognition

Title:
```
  Transferable Adversarial Facial Images for Privacy Protection
```
Authors: Minghui Li, Jiangxiong Wang, Hao Zhang, Ziqi Zhou, Shengshan Hu, Xiaobing Pei
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The success of deep face recognition (FR) systems has raised serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Previous studies proposed introducing imperceptible adversarial noises into face images to deceive those face recognition models, thus achieving the goal of enhancing facial privacy protection. Nevertheless, they heavily rely on user-chosen references to guide the generation of adversarial noises, and cannot simultaneously construct natural and highly transferable adversarial face images in black-box scenarios. In light of this, we present a novel face privacy protection scheme with improved transferability while maintain high visual quality. We propose shaping the entire face space directly instead of exploiting one kind of facial characteristic like makeup information to integrate adversarial noises. To achieve this goal, we first exploit global adversarial latent search to traverse the latent space of the generative model, thereby creating natural adversarial face images with high transferability. We then introduce a key landmark regularization module to preserve the visual identity information. Finally, we investigate the impacts of various kinds of latent spaces and find that $\mathcal{F}$ latent space benefits the trade-off between visual naturalness and adversarial transferability. Extensive experiments over two datasets demonstrate that our approach significantly enhances attack transferability while maintaining high visual quality, outperforming state-of-the-art methods by an average 25% improvement in deep FR models and 10% improvement on commercial FR APIs, including Face++, Aliyun, and Tencent.
Title:
```
  Model Hijacking Attack in Federated Learning
```
Authors: Zheng Li, Siyuan Wu, Ruichuan Chen, Paarijaat Aditya, Istemi Ekin Akkus, Manohar Vanga, Min Zhang, Hao Li, Yang Zhang
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Machine learning (ML), driven by prominent paradigms such as centralized and federated learning, has made significant progress in various critical applications ranging from autonomous driving to face recognition. However, its remarkable success has been accompanied by various attacks. Recently, the model hijacking attack has shown that ML models can be hijacked to execute tasks different from their original tasks, which increases both accountability and parasitic computational risks. Nevertheless, thus far, this attack has only focused on centralized learning. In this work, we broaden the scope of this attack to the federated learning domain, where multiple clients collaboratively train a global model without sharing their data. Specifically, we present HijackFL, the first-of-its-kind hijacking attack against the global model in federated learning. The adversary aims to force the global model to perform a different task (called hijacking task) from its original task without the server or benign client noticing. To accomplish this, unlike existing methods that use data poisoning to modify the target model's parameters, HijackFL searches for pixel-level perturbations based on their local model (without modifications) to align hijacking samples with the original ones in the feature space. When performing the hijacking task, the adversary applies these cloaks to the hijacking samples, compelling the global model to identify them as original samples and predict them accordingly. We conduct extensive experiments on four benchmark datasets and three popular models. Empirical results demonstrate that its attack performance outperforms baselines. We further investigate the factors that affect its performance and discuss possible defenses to mitigate its impact.
Title:
```
  HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions
```
Authors: Chiranjeev Chiranjeev, Muskan Dosi, Kartik Thakral, Mayank Vatsa, Richa Singh
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Traditional deep learning models rely on methods such as softmax cross-entropy and ArcFace loss for tasks like classification and face recognition. These methods mainly explore angular features in a hyperspherical space, often resulting in entangled inter-class features due to dense angular data across many classes. In this paper, a new field of feature exploration is proposed known as HyperSpaceX which enhances class discrimination by exploring both angular and radial dimensions in multi-hyperspherical spaces, facilitated by a novel DistArc loss. The proposed DistArc loss encompasses three feature arrangement components: two angular and one radial, enforcing intra-class binding and inter-class separation in multi-radial arrangement, improving feature discriminability. Evaluation of HyperSpaceX framework for the novel representation utilizes a proposed predictive measure that accounts for both angular and radial elements, providing a more comprehensive assessment of model accuracy beyond standard metrics. Experiments across seven object classification and six face recognition datasets demonstrate state-of-the-art (SoTA) results obtained from HyperSpaceX, achieving up to a 20% performance improvement on large-scale object datasets in lower dimensions and up to 6% gain in higher dimensions.
Keyword: augmentation

Title:
```
  Trainable Pointwise Decoder Module for Point Cloud Segmentation
```
Authors: Bike Chen, Chen Gong, Antti Tikanmäki, Juha Röning
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Point cloud segmentation (PCS) aims to make per-point predictions and enables robots and autonomous driving cars to understand the environment. The range image is a dense representation of a large-scale outdoor point cloud, and segmentation models built upon the image commonly execute efficiently. However, the projection of the point cloud onto the range image inevitably leads to dropping points because, at each image coordinate, only one point is kept despite multiple points being projected onto the same location. More importantly, it is challenging to assign correct predictions to the dropped points that belong to the classes different from the kept point class. Besides, existing post-processing methods, such as K-nearest neighbor (KNN) search and kernel point convolution (KPConv), cannot be trained with the models in an end-to-end manner or cannot process varying-density outdoor point clouds well, thereby enabling the models to achieve sub-optimal performance. To alleviate this problem, we propose a trainable pointwise decoder module (PDM) as the post-processing approach, which gathers weighted features from the neighbors and then makes the final prediction for the query point. In addition, we introduce a virtual range image-guided copy-rotate-paste (VRCrop) strategy in data augmentation. VRCrop constrains the total number of points and eliminates undesirable artifacts in the augmented point cloud. With PDM and VRCrop, existing range image-based segmentation models consistently perform better than their counterparts on the SemanticKITTI, SemanticPOSS, and nuScenes datasets.
Title:
```
  Full-range Head Pose Geometric Data Augmentations
```
Authors: Huei-Chung Hu, Xuyang Wu, Haowei Liu, Ting-Ruen Wei, Hsin-Tai Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Many head pose estimation (HPE) methods promise the ability to create full-range datasets, theoretically allowing the estimation of the rotation and positioning of the head from various angles. However, these methods are only accurate within a range of head angles; exceeding this specific range led to significant inaccuracies. This is dominantly explained by unclear specificity of the coordinate systems and Euler Angles used in the foundational rotation matrix calculations. Here, we addressed these limitations by presenting (1) methods that accurately infer the correct coordinate system and Euler angles in the correct axis-sequence, (2) novel formulae for 2D geometric augmentations of the rotation matrices under the (SPECIFIC) coordinate system, (3) derivations for the correct drawing routines for rotation matrices and poses, and (4) mathematical experimentation and verification that allow proper pitch-yaw coverage for full-range head pose dataset generation. Performing our augmentation techniques to existing head pose estimation methods demonstrated a significant improvement to the model performance. Code will be released upon paper acceptance.
Title:
```
  Generating High-quality Symbolic Music Using Fine-grained Discriminators
```
Authors: Zhedong Zhang, Liang Li, Jiehua Zhang, Zhenghui Hu, Hongkui Wang, Chenggang Yan, Jian Yang, Yuankai Qi
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Existing symbolic music generation methods usually utilize discriminator to improve the quality of generated music via global perception of music. However, considering the complexity of information in music, such as rhythm and melody, a single discriminator cannot fully reflect the differences in these two primary dimensions of music. In this work, we propose to decouple the melody and rhythm from music, and design corresponding fine-grained discriminators to tackle the aforementioned issues. Specifically, equipped with a pitch augmentation strategy, the melody discriminator discerns the melody variations presented by the generated samples. By contrast, the rhythm discriminator, enhanced with bar-level relative positional encoding, focuses on the velocity of generated notes. Such a design allows the generator to be more explicitly aware of which aspects should be adjusted in the generated music, making it easier to mimic human-composed music. Experimental results on the POP909 benchmark demonstrate the favorable performance of the proposed method compared to several state-of-the-art methods in terms of both objective and subjective metrics.
Title:
```
  Invariant Graph Learning Meets Information Bottleneck for Out-of-Distribution Generalization
```
Authors: Wenyu Mao, Jiancan Wu, Haoyang Liu, Yongduo Sui, Xiang Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Graph out-of-distribution (OOD) generalization remains a major challenge in graph learning since graph neural networks (GNNs) often suffer from severe performance degradation under distribution shifts. Invariant learning, aiming to extract invariant features across varied distributions, has recently emerged as a promising approach for OOD generation. Despite the great success of invariant learning in OOD problems for Euclidean data (i.e., images), the exploration within graph data remains constrained by the complex nature of graphs. Existing studies, such as data augmentation or causal intervention, either suffer from disruptions to invariance during the graph manipulation process or face reliability issues due to a lack of supervised signals for causal parts. In this work, we propose a novel framework, called Invariant Graph Learning based on Information bottleneck theory (InfoIGL), to extract the invariant features of graphs and enhance models' generalization ability to unseen distributions. Specifically, InfoIGL introduces a redundancy filter to compress task-irrelevant information related to environmental factors. Cooperating with our designed multi-level contrastive learning, we maximize the mutual information among graphs of the same class in the downstream classification tasks, preserving invariant features for prediction to a great extent. An appealing feature of InfoIGL is its strong generalization ability without depending on supervised signal of invariance. Experiments on both synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance under OOD generalization for graph classification tasks. The source code is available at this https URL.
Title:
```
  ST-SACLF: Style Transfer Informed Self-Attention Classifier for Bias-Aware Painting Classification
```
Authors: Mridula Vijendran, Frederick W. B. Li, Jingjing Deng, Hubert P. H. Shum
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Painting classification plays a vital role in organizing, finding, and suggesting artwork for digital and classic art galleries. Existing methods struggle with adapting knowledge from the real world to artistic images during training, leading to poor performance when dealing with different datasets. Our innovation lies in addressing these challenges through a two-step process. First, we generate more data using Style Transfer with Adaptive Instance Normalization (AdaIN), bridging the gap between diverse styles. Then, our classifier gains a boost with feature-map adaptive spatial attention modules, improving its understanding of artistic details. Moreover, we tackle the problem of imbalanced class representation by dynamically adjusting augmented samples. Through a dual-stage process involving careful hyperparameter search and model fine-tuning, we achieve an impressive 87.24\% accuracy using the ResNet-50 backbone over 40 training epochs. Our study explores quantitative analyses that compare different pretrained backbones, investigates model optimization through ablation studies, and examines how varying augmentation levels affect model performance. Complementing this, our qualitative experiments offer valuable insights into the model's decision-making process using spatial attention and its ability to differentiate between easy and challenging samples based on confidence ranking.
Title:
```
  Label Augmentation for Neural Networks Robustness
```
Authors: Fatemeh Amerehi, Patrick Healy
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Out-of-distribution generalization can be categorized into two types: common perturbations arising from natural variations in the real world and adversarial perturbations that are intentionally crafted to deceive neural networks. While deep neural networks excel in accuracy under the assumption of identical distributions between training and test data, they often encounter out-of-distribution scenarios resulting in a significant decline in accuracy. Data augmentation methods can effectively enhance robustness against common corruptions, but they typically fall short in improving robustness against adversarial perturbations. In this study, we develop Label Augmentation (LA), which enhances robustness against both common and intentional perturbations and improves uncertainty estimation. Our findings indicate a Clean error rate improvement of up to 23.29% when employing LA in comparisons to the baseline. Additionally, it enhances robustness under common corruptions benchmark by up to 24.23%. When tested against FGSM and PGD attacks, improvements in adversarial robustness are noticeable, with enhancements of up to 53.18% for FGSM and 24.46% for PGD attacks.
Title:
```
  Unsupervised Representation Learning by Balanced Self Attention Matching
```
Authors: Daniel Shalam, Simon Korman
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Many leading self-supervised methods for unsupervised representation learning, in particular those for embedding image features, are built on variants of the instance discrimination task, whose optimization is known to be prone to instabilities that can lead to feature collapse. Different techniques have been devised to circumvent this issue, including the use of negative pairs with different contrastive losses, the use of external memory banks, and breaking of symmetry by using separate encoding networks with possibly different structures. Our method, termed BAM, rather than directly matching features of different views (augmentations) of input images, is based on matching their self-attention vectors, which are the distributions of similarities to the entire set of augmented images of a batch. We obtain rich representations and avoid feature collapse by minimizing a loss that matches these distributions to their globally balanced and entropy regularized version, which is obtained through a simple self-optimal-transport computation. We ablate and verify our method through a wide set of experiments that show competitive performance with leading methods on both semi-supervised and transfer-learning benchmarks. Our implementation and pre-trained models are available at this http URL .
Title:
```
  Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
```
Authors: Shuhao Guan, Derek Greene
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.
Title:
```
  The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024
```
Authors: He Wang, Lei Xie
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP (Team 237) in the second Chinese Continuous Visual Speech Recognition Challenge (CNVSRC 2024), engaging in all four tracks, including the fixed and open tracks of Single-Speaker VSR Task and Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multiscale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, introducing Enhanced ResNet3D visual frontend, E-Branchformer encoder, and Bi-directional Transformer decoder. Our approach yields a 30.47% CER for the Single-Speaker Task and 34.30% CER for the Multi-Speaker Task, securing second place in the open track of the Single-Speaker Task and first place in the other three tracks.

LeeKyungwook / get-arxiv-noti

New submissions for Tuesday, 6 August 2024 (showing 557 of 557 entries ) #1216

Keyword: detection

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: face recognition

Title:

Title:

Title:

Keyword: augmentation

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title: