New submissions for Tue, 16 Apr 24

Keyword: detection

Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection

Authors: Authors: Sourya Dipta Das, Yash Vadi, Kuldeep Yadav
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08655
Pdf link: https://arxiv.org/pdf/2404.08655
Abstract Automated Essay Scoring (AES) systems are widely popular in the market as they constitute a cost-effective and time-effective option for grading systems. Nevertheless, many studies have demonstrated that the AES system fails to assign lower grades to irrelevant responses. Thus, detecting the off-topic response in automated essay scoring is crucial in practical tasks where candidates write unrelated text responses to the given task in the question. In this paper, we are proposing an unsupervised technique that jointly scores essays and detects off-topic essays. The proposed Automated Open Essay Scoring (AOES) model uses a novel topic regularization module (TRM), which can be attached on top of a transformer model, and is trained using a proposed hybrid loss function. After training, the AOES model is further used to calculate the Mahalanobis distance score for off-topic essay detection. Our proposed method outperforms the baseline we created and earlier conventional methods on two essay-scoring datasets in off-topic detection as well as on-topic scoring. Experimental evaluation results on different adversarial strategies also show how the suggested method is robust for detecting possible human-level perturbations.
Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus
Authors: Authors: Silvia García-Méndez, Milagros Fernández-Gavilanes, Jonathan Juncal-Martínez, Francisco J. González-Castaño, Oscar Barba Seara
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08664
Pdf link: https://arxiv.org/pdf/2404.08664
Abstract Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.
Your Finetuned Large Language Model is Already a Powerful Out-of-distribution Detector
Authors: Authors: Andi Zhang, Tim Z. Xiao, Weiyang Liu, Robert Bamler, Damon Wischik
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2404.08679
Pdf link: https://arxiv.org/pdf/2404.08679
Abstract We revisit the likelihood ratio between a pretrained large language model (LLM) and its finetuned variant as a criterion for out-of-distribution (OOD) detection. The intuition behind such a criterion is that, the pretrained LLM has the prior knowledge about OOD data due to its large amount of training data, and once finetuned with the in-distribution data, the LLM has sufficient knowledge to distinguish their difference. Leveraging the power of LLMs, we show that, for the first time, the likelihood ratio can serve as an effective OOD detector. Moreover, we apply the proposed LLM-based likelihood ratio to detect OOD questions in question-answering (QA) systems, which can be used to improve the performance of specialized LLMs for general questions. Given that likelihood can be easily obtained by the loss functions within contemporary neural network frameworks, it is straightforward to implement this approach in practice. Since both the pretrained LLMs and its various finetuned models are available, our proposed criterion can be effortlessly incorporated for OOD detection without the need for further training. We conduct comprehensive evaluation across on multiple settings, including far OOD, near OOD, spam detection, and QA scenarios, to demonstrate the effectiveness of the method.
FastLogAD: Log Anomaly Detection with Mask-Guided Pseudo Anomaly Generation and Discrimination
Authors: Authors: Yifei Lin, Hanqiu Deng, Xingyu Li
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08750
Pdf link: https://arxiv.org/pdf/2404.08750
Abstract Nowadays large computers extensively output logs to record the runtime status and it has become crucial to identify any suspicious or malicious activities from the information provided by the realtime logs. Thus, fast log anomaly detection is a necessary task to be implemented for automating the infeasible manual detection. Most of the existing unsupervised methods are trained only on normal log data, but they usually require either additional abnormal data for hyperparameter selection or auxiliary datasets for discriminative model optimization. In this paper, aiming for a highly effective discriminative model that enables rapid anomaly detection,we propose FastLogAD, a generator-discriminator framework trained to exhibit the capability of generating pseudo-abnormal logs through the Mask-Guided Anomaly Generation (MGAG) model and efficiently identifying the anomalous logs via the Discriminative Abnormality Separation (DAS) model. Particularly, pseudo-abnormal logs are generated by replacing randomly masked tokens in a normal sequence with unlikely candidates. During the discriminative stage, FastLogAD learns a distinct separation between normal and pseudoabnormal samples based on their embedding norms, allowing the selection of a threshold without exposure to any test data and achieving competitive performance. Extensive experiments on several common benchmarks show that our proposed FastLogAD outperforms existing anomaly detection approaches. Furthermore, compared to previous methods, FastLogAD achieves at least x10 speed increase in anomaly detection over prior work. Our implementation is available at https://github.com/YifeiLin0226/FastLogAD.
Detecting AI-Generated Images via CLIP
Authors: Authors: A.G. Moskowitz, T. Gaona, J. Peterson
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08788
Pdf link: https://arxiv.org/pdf/2404.08788
Abstract As AI-generated image (AIGI) methods become more powerful and accessible, it has become a critical task to determine if an image is real or AI-generated. Because AIGI lack the signatures of photographs and have their own unique patterns, new models are needed to determine if an image is AI-generated. In this paper, we investigate the ability of the Contrastive Language-Image Pre-training (CLIP) architecture, pre-trained on massive internet-scale data sets, to perform this differentiation. We fine-tune CLIP on real images and AIGI from several generative models, enabling CLIP to determine if an image is AI-generated and, if so, determine what generation method was used to create it. We show that the fine-tuned CLIP architecture is able to differentiate AIGI as well or better than models whose architecture is specifically designed to detect AIGI. Our method will significantly increase access to AIGI-detecting tools and reduce the negative effects of AIGI on society, as our CLIP fine-tuning procedures require no architecture changes from publicly available model repositories and consume significantly less GPU resources than other AIGI detection models.
Enhancing IoT Malware Detection through Adaptive Model Parallelism and Resource Optimization
Authors: Authors: Sreenitha Kasarapu, Sanket Shukla, Sai Manoj Pudukotai Dinakarrao
Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2404.08808
Pdf link: https://arxiv.org/pdf/2404.08808
Abstract The widespread integration of IoT devices has greatly improved connectivity and computational capabilities, facilitating seamless communication across networks. Despite their global deployment, IoT devices are frequently targeted for security breaches due to inherent vulnerabilities. Among these threats, malware poses a significant risk to IoT devices. The lack of built-in security features and limited resources present challenges for implementing effective malware detection techniques on IoT devices. Moreover, existing methods assume access to all device resources for malware detection, which is often not feasible for IoT devices deployed in critical real-world scenarios. To overcome this challenge, this study introduces a novel approach to malware detection tailored for IoT devices, leveraging resource and workload awareness inspired by model parallelism. Initially, the device assesses available resources for malware detection using a lightweight regression model. Based on resource availability, ongoing workload, and communication costs, the malware detection task is dynamically allocated either on-device or offloaded to neighboring IoT nodes with sufficient resources. To uphold data integrity and user privacy, instead of transferring the entire malware detection task, the classifier is divided and distributed across multiple nodes, then integrated at the parent node for detection. Experimental results demonstrate that this proposed technique achieves a significant speedup of 9.8 x compared to on-device inference, while maintaining a high malware detection accuracy of 96.7%.
E3: Ensemble of Expert Embedders for Adapting Synthetic Image Detectors to New Generators Using Limited Data
Authors: Authors: Aref Azizpour, Tai D. Nguyen, Manil Shrestha, Kaidi Xu, Edward Kim, Matthew C. Stamm
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08814
Pdf link: https://arxiv.org/pdf/2404.08814
Abstract As generative AI progresses rapidly, new synthetic image generators continue to emerge at a swift pace. Traditional detection methods face two main challenges in adapting to these generators: the forensic traces of synthetic images from new techniques can vastly differ from those learned during training, and access to data for these new generators is often limited. To address these issues, we introduce the Ensemble of Expert Embedders (E3), a novel continual learning framework for updating synthetic image detectors. E3 enables the accurate detection of images from newly emerged generators using minimal training data. Our approach does this by first employing transfer learning to develop a suite of expert embedders, each specializing in the forensic traces of a specific generator. Then, all embeddings are jointly analyzed by an Expert Knowledge Fusion Network to produce accurate and reliable detection decisions. Our experiments demonstrate that E3 outperforms existing continual learning methods, including those developed specifically for synthetic image detection.
Empowering Malware Detection Efficiency within Processing-in-Memory Architecture
Authors: Authors: Sreenitha Kasarapu, Sathwika Bavikadi, Sai Manoj Pudukotai Dinakarrao
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2404.08818
Pdf link: https://arxiv.org/pdf/2404.08818
Abstract The widespread integration of embedded systems across various industries has facilitated seamless connectivity among devices and bolstered computational capabilities. Despite their extensive applications, embedded systems encounter significant security threats, with one of the most critical vulnerabilities being malicious software, commonly known as malware. In recent times, malware detection techniques leveraging Machine Learning have gained popularity. Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) have proven particularly efficient in image processing tasks. However, one major drawback of neural network architectures is their substantial computational resource requirements. Continuous training of malware detection models with updated malware and benign samples demands immense computational resources, presenting a challenge for real-world applications. In response to these concerns, we propose a Processing-in-Memory (PIM)-based architecture to mitigate memory access latency, thereby reducing the resources consumed during model updates. To further enhance throughput and minimize energy consumption, we incorporate precision scaling techniques tailored for CNN models. Our proposed PIM architecture exhibits a 1.09x higher throughput compared to existing Lookup Table (LUT)-based PIM architectures. Additionally, precision scaling combined with PIM enhances energy efficiency by 1.5x compared to full-precision operations, without sacrificing performance. This innovative approach offers a promising solution to the resource-intensive nature of malware detection model updates, paving the way for more efficient and sustainable cybersecurity practices.
Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic Mapping
Authors: Authors: Lei Zhang, Kaixin Bai, Guowen Huang, Zhaopeng Chen, Jianwei Zhang
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08844
Pdf link: https://arxiv.org/pdf/2404.08844
Abstract The integration of optimization method and generative models has significantly advanced dexterous manipulation techniques for five-fingered hand grasping. Yet, the application of these techniques in cluttered environments is a relatively unexplored area. To address this research gap, we have developed a novel method for generating five-fingered hand grasp samples in cluttered settings. This method emphasizes simulated grasp quality and the nuanced interaction between the hand and surrounding objects. A key aspect of our approach is our data generation method, capable of estimating contact spatial and semantic representations and affordance grasps based on object affordance information. Furthermore, our Contact Semantic Conditional Variational Autoencoder (CoSe-CVAE) network is adept at creating comprehensive contact maps from point clouds, incorporating both spatial and semantic data. We introduce a unique grasp detection technique that efficiently formulates mechanical hand grasp poses from these maps. Additionally, our evaluation model is designed to assess grasp quality and collision probability, significantly improving the practicality of five-fingered hand grasping in complex scenarios. Our data generation method outperforms previous datasets in grasp diversity, scene diversity, modality diversity. Our grasp generation method has demonstrated remarkable success, outperforming established baselines with 81.0% average success rate in real-world single-object grasping and 75.3% success rate in multi-object grasping. The dataset and supplementary materials can be found at https://sites.google.com/view/ffh-clutteredgrasping, and we will release the code upon publication.
ChangeAnywhere: Sample Generation for Remote Sensing Change Detection via Semantic Latent Diffusion Model
Authors: Authors: Kai Tang, Jin Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08892
Pdf link: https://arxiv.org/pdf/2404.08892
Abstract Remote sensing change detection (CD) is a pivotal technique that pinpoints changes on a global scale based on multi-temporal images. With the recent expansion of deep learning, supervised deep learning-based CD models have shown satisfactory performance. However, CD sample labeling is very time-consuming as it is densely labeled and requires expert knowledge. To alleviate this problem, we introduce ChangeAnywhere, a novel CD sample generation method using the semantic latent diffusion model and single-temporal images. Specifically, ChangeAnywhere leverages the relative ease of acquiring large single-temporal semantic datasets to generate large-scale, diverse, and semantically annotated bi-temporal CD datasets. ChangeAnywhere captures the two essentials of CD samples, i.e., change implies semantically different, and non-change implies reasonable change under the same semantic constraints. We generated ChangeAnywhere-100K, the largest synthesis CD dataset with 100,000 pairs of CD samples based on the proposed method. The ChangeAnywhere-100K significantly improved both zero-shot and few-shot performance on two CD benchmark datasets for various deep learning-based CD models, as demonstrated by transfer experiments. This paper delineates the enormous potential of ChangeAnywhere for CD sample generation and demonstrates the subsequent enhancement of model performance. Therefore, ChangeAnywhere offers a potent tool for remote sensing CD. All codes and pre-trained models will be available at https://github.com/tangkai-RS/ChangeAnywhere.
Meply: A Large-scale Dataset and Baseline Evaluations for Metastatic Perirectal Lymph Node Detection and Segmentation
Authors: Authors: Weidong Guo, Hantao Zhang, Shouhong Wan, Bingbing Zou, Wanqin Wang, Chenyang Qiu, Jun Li, Peiquan Jin
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08916
Pdf link: https://arxiv.org/pdf/2404.08916
Abstract Accurate segmentation of metastatic lymph nodes in rectal cancer is crucial for the staging and treatment of rectal cancer. However, existing segmentation approaches face challenges due to the absence of pixel-level annotated datasets tailored for lymph nodes around the rectum. Additionally, metastatic lymph nodes are characterized by their relatively small size, irregular shapes, and lower contrast compared to the background, further complicating the segmentation task. To address these challenges, we present the first large-scale perirectal metastatic lymph node CT image dataset called Meply, which encompasses pixel-level annotations of 269 patients diagnosed with rectal cancer. Furthermore, we introduce a novel lymph-node segmentation model named CoSAM. The CoSAM utilizes sequence-based detection to guide the segmentation of metastatic lymph nodes in rectal cancer, contributing to improved localization performance for the segmentation model. It comprises three key components: sequence-based detection module, segmentation module, and collaborative convergence unit. To evaluate the effectiveness of CoSAM, we systematically compare its performance with several popular segmentation methods using the Meply dataset. Our code and dataset will be publicly available at: https://github.com/kanydao/CoSAM.
Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image Modeling
Authors: Authors: Sambal Shikhar, Anupam Sobti
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08931
Pdf link: https://arxiv.org/pdf/2404.08931
Abstract Detecting various types of stresses (nutritional, water, nitrogen, etc.) in agricultural fields is critical for farmers to ensure maximum productivity. However, stresses show up in different shapes and sizes across different crop types and varieties. Hence, this is posed as an anomaly detection task in agricultural images. Accurate anomaly detection in agricultural UAV images is vital for early identification of field irregularities. Traditional supervised learning faces challenges in adapting to diverse anomalies, necessitating extensive annotated data. In this work, we overcome this limitation with self-supervised learning using a masked image modeling approach. Masked Autoencoders (MAE) extract meaningful normal features from unlabeled image samples which produces high reconstruction error for the abnormal pixels during reconstruction. To remove the need of using only normal" data while training, we use an anomaly suppression loss mechanism that effectively minimizes the reconstruction of anomalous pixels and allows the model to learn anomalous areas without explicitly separatingnormal" images for training. Evaluation on the Agriculture-Vision data challenge shows a mIOU score improvement in comparison to prior state of the art in unsupervised and self-supervised methods. A single model generalizes across all the anomaly categories in the Agri-Vision Challenge Dataset
Shifting Spotlight for Co-supervision: A Simple yet Efficient Single-branch Network to See Through Camouflage
Authors: Authors: Yang Hu, Jinxia Zhang, Kaihua Zhang, Yin Yuan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08936
Pdf link: https://arxiv.org/pdf/2404.08936
Abstract Efficient and accurate camouflaged object detection (COD) poses a challenge in the field of computer vision. Recent approaches explored the utility of edge information for network co-supervision, achieving notable advancements. However, these approaches introduce an extra branch for complex edge extraction, complicate the model architecture and increases computational demands. Addressing this issue, our work replicates the effect that animal's camouflage can be easily revealed under a shifting spotlight, and leverages it for network co-supervision to form a compact yet efficient single-branch network, the Co-Supervised Spotlight Shifting Network (CS$^3$Net). The spotlight shifting strategy allows CS$^3$Net to learn additional prior within a single-branch framework, obviating the need for resource demanding multi-branch design. To leverage the prior of spotlight shifting co-supervision, we propose Shadow Refinement Module (SRM) and Projection Aware Attention (PAA) for feature refinement and enhancement. To ensure the continuity of multi-scale features aggregation, we utilize the Extended Neighbor Connection Decoder (ENCD) for generating the final predictions. Empirical evaluations on public datasets confirm that our CS$^3$Net offers an optimal balance between efficiency and performance: it accomplishes a 32.13% reduction in Multiply-Accumulate (MACs) operations compared to leading efficient COD models, while also delivering superior performance.
Zero-Shot Code Representation Learning via Prompt Tuning
Authors: Authors: Nan Cui, Xiaodong Gu, Beijun Shen
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2404.08947
Pdf link: https://arxiv.org/pdf/2404.08947
Abstract Learning code representations has been the core prerequisite of many software engineering tasks such as code clone detection and code generation. State-of-the-art program representation techniques mainly utilize pre-trained language models (PLMs) such as CodeBERT. A Transformer encoder is firstly pre-trained on a large-scale code corpus to acquire general knowledge about source code. The pre-trained model is then fine-tuned on specific tasks using an amount of labeled data. However, gathering training samples for the downstream tasks can be prohibitively expensive and impractical for domain-specific languages or project-specific tasks. Besides, pre-training and downstream tasks are usually heterogeneous, which makes it difficult to fully explore the knowledge learned during pre-training. In this paper, we propose Zecoler, a zero-shot approach for learning code representations. Zecoler is built upon a pre-trained programming language model. In order to elicit knowledge from the PLMs efficiently, Zecoler casts the downstream tasks to the same form of pre-training objectives by inserting train-able prompts into the original input. These prompts can guide PLMs on how to generate better results. Subsequently, we employ the prompt tuning technique to search for the optimal prompts for PLMs automatically. This enables the representation model to efficiently fit the downstream tasks through fine-tuning on the dataset in source language domain and then reuse the pre-trained knowledge for the target domain in a zero-shot style. We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation. The results show that our approach significantly outperforms baseline models under the zero-shot setting.
Timing Advance Estimation in Low Earth Orbit Satellite Networks
Authors: Authors: Jianfeng Zhu, Yaohua Sun, Mugen Peng
Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2404.08960
Pdf link: https://arxiv.org/pdf/2404.08960
Abstract Low earth orbit (LEO) satellite communication based on 3GPP standard is seen as a promising solution to rolling out communication services in areas without terrestrial base stations. However, due to the fast movement of satellites and large beam footprint size, the existing 5G timing advance (TA) estimation mechanism cannot be directly applied when global navigation satellite system is unavailable. In this article, an enhanced TA estimation approach is proposed for LEO satellite communication networks. Specifically, a user-side time-frequency pre-compensation method is introduced at first, which leverages frequency offset measurement on synchronization signal blocks broadcasted by satellites in initial cell search phase. For the random access phase, the upper bound of inter-preamble interference incurred by partial-period cross-correlation operations is derived for a preamble format advised by 3GPP, and it is shown that the interference level is closely related to the square of the number of such operations. Inspired by this result, a cyclic prefix free preamble format is further designed, which features extended guard time, differential power allocation and flexible preamble structure. Numerical results show that our proposal can reduce the missed detection rate of preamble within a beam. Particularly, the missed detection rates of preamble under 32, 48, and 64 users are lower than 1% when SNR = -6 dB, which is a significant improvement compared to baselines. In addition, our proposal can limit the TA estimation error of the detected users to the time length of 25 time-domain sampling points when the subcarrier spacing is 30 kHz and operation frequency is 27 GHz.
BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection
Authors: Authors: Jian Zhang, Ruiteng Zhang, Xinyue Yan, Xiting Zhuang, Ruicheng Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08979
Pdf link: https://arxiv.org/pdf/2404.08979
Abstract Degraded underwater images decrease the accuracy of underwater object detection. However, existing methods for underwater image enhancement mainly focus on improving the indicators in visual aspects, which may not benefit the tasks of underwater image detection, and may lead to serious degradation in performance. To alleviate this problem, we proposed a bidirectional-guided method for underwater object detection, referred to as BG-YOLO. In the proposed method, network is organized by constructing an enhancement branch and a detection branch in a parallel way. The enhancement branch consists of a cascade of an image enhancement subnet and an object detection subnet. And the detection branch only consists of a detection subnet. A feature guided module connects the shallow convolution layer of the two branches. When training the enhancement branch, the object detection subnet in the enhancement branch guides the image enhancement subnet to be optimized towards the direction that is most conducive to the detection task. The shallow feature map of the trained enhancement branch will be output to the feature guided module, constraining the optimization of detection branch through consistency loss and prompting detection branch to learn more detailed information of the objects. And hence the detection performance will be refined. During the detection tasks, only detection branch will be reserved so that no additional cost of computation will be introduced. Extensive experiments demonstrate that the proposed method shows significant improvement in performance of the detector in severely degraded underwater scenes while maintaining a remarkable detection speed.
A Fourier-enhanced multi-modal 3D small object optical mark recognition and positioning method for percutaneous abdominal puncture surgical navigation
Authors: Authors: Zezhao Guo (1), Yanzhong Guo (2), Zhanfang Zhao (1) ((1) College of information and Engineering, Hebei GEO University, (2) Beijing Yingrui Pioneer Medical Technology Co., Ltd)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.08990
Pdf link: https://arxiv.org/pdf/2404.08990
Abstract Navigation for thoracoabdominal puncture surgery is used to locate the needle entry point on the patient's body surface. The traditional reflective ball navigation method is difficult to position the needle entry point on the soft, irregular, smooth chest and abdomen. Due to the lack of clear characteristic points on the body surface using structured light technology, it is difficult to identify and locate arbitrary needle insertion points. Based on the high stability and high accuracy requirements of surgical navigation, this paper proposed a novel method, a muti-modal 3D small object medical marker detection method, which identifies the center of a small single ring as the needle insertion point. Moreover, this novel method leverages Fourier transform enhancement technology to augment the dataset, enrich image details, and enhance the network's capability. The method extracts the Region of Interest (ROI) of the feature image from both enhanced and original images, followed by generating a mask map. Subsequently, the point cloud of the ROI from the depth map is obtained through the registration of ROI point cloud contour fitting. In addition, this method employs Tukey loss for optimal precision. The experimental results show this novel method proposed in this paper not only achieves high-precision and high-stability positioning, but also enables the positioning of any needle insertion point.
Towards Efficient Device Identification in Massive Random Access: A Multi-stage Approach
Authors: Authors: Jyotish Robin, Elza Erkip
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2404.09062
Pdf link: https://arxiv.org/pdf/2404.09062
Abstract Efficient and low-latency wireless connectivity between the base station (BS) and a sparse set of sporadically active devices from a massive number of devices is crucial for random access in emerging massive machine-type communications (mMTC). This paper addresses the challenge of identifying active devices while meeting stringent access delay and reliability constraints in mMTC environments. A novel multi-stage active device identification framework is proposed where we aim to refine a partial estimate of the active device set using feedback and hypothesis testing across multiple stages eventually leading to an exact recovery of active devices after the final stage of processing. In our proposed approach, active devices independently transmit binary preambles during each stage, leveraging feedback signals from the BS, whereas the BS employs a non-coherent binary energy detection. The minimum user identification cost associated with our multi-stage non-coherent active device identification framework with feedback, in terms of the required number of channel-uses, is quantified using information-theoretic techniques in the asymptotic regime of total number of devices $\ell$ when the number of active devices $k$ scales as $k = {\Theta}(1)$. Practical implementations of our multi-stage active device identification schemes, leveraging Belief Propagation (BP) techniques, are also presented and evaluated. Simulation results show that our multi-stage BP strategies exhibit superior performance over single-stage strategies, even when considering overhead costs associated with feedback and hypothesis testing.
Fusion-Mamba for Cross-modality Object Detection
Authors: Authors: Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Xuhui Liu, Juan Zhang, Guodong Guo, Baochang Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.09146
Pdf link: https://arxiv.org/pdf/2404.09146
Abstract Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
Coreset Selection for Object Detection
Authors: Authors: Hojun Lee, Suyoung Kim, Junhoo Lee, Jaeyoung Yoo, Nojun Kwak
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09161
Pdf link: https://arxiv.org/pdf/2404.09161
Abstract Coreset selection is a method for selecting a small, representative subset of an entire dataset. It has been primarily researched in image classification, assuming there is only one object per image. However, coreset selection for object detection is more challenging as an image can contain multiple objects. As a result, much research has yet to be done on this topic. Therefore, we introduce a new approach, Coreset Selection for Object Detection (CSOD). CSOD generates imagewise and classwise representative feature vectors for multiple objects of the same class within each image. Subsequently, we adopt submodular optimization for considering both representativeness and diversity and utilize the representative vectors in the submodular optimization process to select a subset. When we evaluated CSOD on the Pascal VOC dataset, CSOD outperformed random selection by +6.4%p in AP$_{50}$ when selecting 200 images.
HANet: A Hierarchical Attention Network for Change Detection With Bitemporal Very-High-Resolution Remote Sensing Images
Authors: Authors: Chengxi Han, Chen Wu, Haonan Guo, Meiqi Hu, Hongruixuan Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09178
Pdf link: https://arxiv.org/pdf/2404.09178
Abstract Benefiting from the developments in deep learning technology, deep-learning-based algorithms employing automatic feature extraction have achieved remarkable performance on the change detection (CD) task. However, the performance of existing deep-learning-based CD methods is hindered by the imbalance between changed and unchanged pixels. To tackle this problem, a progressive foreground-balanced sampling strategy on the basis of not adding change information is proposed in this article to help the model accurately learn the features of the changed pixels during the early training process and thereby improve detection performance.Furthermore, we design a discriminative Siamese network, hierarchical attention network (HANet), which can integrate multiscale features and refine detailed features. The main part of HANet is the HAN module, which is a lightweight and effective self-attention mechanism. Extensive experiments and ablation studies on two CDdatasets with extremely unbalanced labels validate the effectiveness and efficiency of the proposed method.
Change Guiding Network: Incorporating Change Prior to Guide Change Detection in Remote Sensing Imagery
Authors: Authors: Chengxi Han, Chen Wu, Haonan Guo, Meiqi Hu, Jiepan Li, Hongruixuan Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2404.09179
Pdf link: https://arxiv.org/pdf/2404.09179
Abstract The rapid advancement of automated artificial intelligence algorithms and remote sensing instruments has benefited change detection (CD) tasks. However, there is still a lot of space to study for precise detection, especially the edge integrity and internal holes phenomenon of change features. In order to solve these problems, we design the Change Guiding Network (CGNet), to tackle the insufficient expression problem of change features in the conventional U-Net structure adopted in previous methods, which causes inaccurate edge detection and internal holes. Change maps from deep features with rich semantic information are generated and used as prior information to guide multi-scale feature fusion, which can improve the expression ability of change features. Meanwhile, we propose a self-attention module named Change Guide Module (CGM), which can effectively capture the long-distance dependency among pixels and effectively overcome the problem of the insufficient receptive field of traditional convolutional neural networks. On four major CD datasets, we verify the usefulness and efficiency of the CGNet, and a large number of experiments and ablation studies demonstrate the effectiveness of CGNet. We're going to open-source our code at https://github.com/ChengxiHAN/CGNet-CD.
FaceCat: Enhancing Face Recognition Security with a Unified Generative Model Framework
Authors: Authors: Jiawei Chen, Xiao Yang, Yinpeng Dong, Hang Su, Jianteng Peng, Zhaoxia Yin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09193
Pdf link: https://arxiv.org/pdf/2404.09193
Abstract Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded as critical technologies to ensure the safety of face recognition systems. As a consequence of their limited practicality and generalization, some existing methods aim to devise a framework capable of concurrently detecting both threats to address the challenge. Nevertheless, these methods still encounter challenges of insufficient generalization and suboptimal robustness, potentially owing to the inherent drawback of discriminative models. Motivated by the rich structural and detailed features of face generative models, we propose FaceCat which utilizes the face generative model as a pre-trained model to improve the performance of FAS and FAD. Specifically, FaceCat elaborately designs a hierarchical fusion mechanism to capture rich face semantic features of the generative model. These features then serve as a robust foundation for a lightweight head, designed to execute FAS and FAD tasks simultaneously. As relying solely on single-modality data often leads to suboptimal performance, we further propose a novel text-guided multi-modal alignment strategy that utilizes text prompts to enrich feature representation, thereby enhancing performance. For fair evaluations, we build a comprehensive protocol with a wide range of 28 attack types to benchmark the performance. Extensive experiments validate the effectiveness of FaceCat generalizes significantly better and obtains excellent robustness against input transformations.
Joint Near Field Uplink Communication and Localization Using Message Passing-Based Sparse Bayesian Learning
Authors: Authors: Fei Liu, Zhengdao Yuan, Qinghua Guo, Yuanyuan Zhang, Zhongyong Wang, J. Andrew Zhang
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2404.09201
Pdf link: https://arxiv.org/pdf/2404.09201
Abstract This work deals with the problem of uplink communication and localization in an integrated sensing and communication system, where users are in the near field (NF) of antenna aperture due to the use of high carrier frequency and large antenna arrays at base stations. We formulate joint NF signal detection and localization as a problem of recovering signals with a sparse pattern. To solve the problem, we develop a message passing based sparse Bayesian learning (SBL) algorithm, where multiple unitary approximate message passing (UAMP)-based sparse signal estimators work jointly to recover the sparse signals with low complexity. Simulation results demonstrate the effectiveness of the proposed method.
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Authors: Authors: Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09216
Pdf link: https://arxiv.org/pdf/2404.09216
Abstract Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.
Fault Detection in Mobile Networks Using Diffusion Models
Authors: Authors: Mohamad Nabeel, Doumitrou Daniil Nimara, Tahar Zanouda
Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2404.09240
Pdf link: https://arxiv.org/pdf/2404.09240
Abstract In today's hyper-connected world, ensuring the reliability of telecom networks becomes increasingly crucial. Telecom networks encompass numerous underlying and intertwined software and hardware components, each providing different functionalities. To ensure the stability of telecom networks, telecom software, and hardware vendors developed several methods to detect any aberrant behavior in telecom networks and enable instant feedback and alerts. These approaches, although powerful, struggle to generalize due to the unsteady nature of the software-intensive embedded system and the complexity and diversity of multi-standard mobile networks. In this paper, we present a system to detect anomalies in telecom networks using a generative AI model. We evaluate several strategies using diffusion models to train the model for anomaly detection using multivariate time-series data. The contributions of this paper are threefold: (i) A proposal of a framework for utilizing diffusion models for time-series anomaly detection in telecom networks, (ii) A proposal of a particular Diffusion model architecture that outperforms other state-of-the-art techniques, (iii) Experiments on a real-world dataset to demonstrate that our model effectively provides explainable results, exposing some of its limitations and suggesting future research avenues to enhance its capabilities further.
An Empirical Evaluation of Manually Created Equivalent Mutants
Authors: Authors: Philipp Straubinger, Alexander Degenhart, Gordon Fraser
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2404.09241
Pdf link: https://arxiv.org/pdf/2404.09241
Abstract Mutation testing consists of evaluating how effective test suites are at detecting artificially seeded defects in the source code, and guiding the improvement of the test suites. Although mutation testing tools are increasingly adopted in practice, equivalent mutants, i.e., mutants that differ only in syntax but not semantics, hamper this process. While prior research investigated how frequently equivalent mutants are produced by mutation testing tools and how effective existing methods of detecting these equivalent mutants are, it remains unclear to what degree humans also create equivalent mutants, and how well they perform at identifying these. We therefore study these questions in the context of Code Defenders, a mutation testing game, in which players competitively produce mutants and tests. Using manual inspection as well as automated identification methods we establish that less than 10 % of manually created mutants are equivalent. Surprisingly, our findings indicate that a significant portion of developers struggle to accurately identify equivalent mutants, emphasizing the need for improved detection mechanisms and developer training in mutation testing.
TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model
Authors: Authors: Wiktor Mucha, Florin Cuconasu, Naome A. Etori, Valia Kalokyri, Giovanni Trappolini
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09254
Pdf link: https://arxiv.org/pdf/2404.09254
Abstract The ability to read, understand and find important information from written text is a critical skill in our daily lives for our independence, comfort and safety. However, a significant part of our society is affected by partial vision impairment, which leads to discomfort and dependency in daily activities. To address the limitations of this part of society, we propose an intelligent reading assistant based on smart glasses with embedded RGB cameras and a Large Language Model (LLM), whose functionality goes beyond corrective lenses. The video recorded from the egocentric perspective of a person wearing the glasses is processed to localise text information using object detection and optical character recognition methods. The LLM processes the data and allows the user to interact with the text and responds to a given query, thus extending the functionality of corrective lenses with the ability to find and summarize knowledge from the text. To evaluate our method, we create a chat-based application that allows the user to interact with the system. The evaluation is conducted in a real-world setting, such as reading menus in a restaurant, and involves four participants. The results show robust accuracy in text retrieval. The system not only provides accurate meal suggestions but also achieves high user satisfaction, highlighting the potential of smart glasses and LLMs in assisting people with special needs.
Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection
Authors: Authors: Jin Yang, Ping Wei, Huan Li, Ziyang Ren
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.09263
Pdf link: https://arxiv.org/pdf/2404.09263
Abstract Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied. Although existing studies have made impressive advancement recently, they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects, resulting in poor model performance. In this paper, we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks, we propose an inter-task feedback mechanism, which transforms the results of one task as guiding masks to assist the other task. Different from existing methods, we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at https://github.com/EdenGabriel/TaskWeave.
Reap the Wild Wind: Detecting Media Storms in Large-Scale News Corpora
Authors: Authors: Dror K. Markus, Effi Levi, Tamir Sheafer, Shaul R. Shenhav
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09299
Pdf link: https://arxiv.org/pdf/2404.09299
Abstract Media Storms, dramatic outbursts of attention to a story, are central components of media dynamics and the attention landscape. Despite their significance, there has been little systematic and empirical research on this concept due to issues of measurement and operationalization. We introduce an iterative human-in-the-loop method to identify media storms in a large-scale corpus of news articles. The text is first transformed into signals of dispersion based on several textual characteristics. In each iteration, we apply unsupervised anomaly detection to these signals; each anomaly is then validated by an expert to confirm the presence of a storm, and those results are then used to tune the anomaly detection in the next iteration. We demonstrate the applicability of this method in two scenarios: first, supplementing an initial list of media storms within a specific time frame; and second, detecting media storms in new time periods. We make available a media storm dataset compiled using both scenarios. Both the method and dataset offer the basis for comprehensive empirical research into the concept of media storms, including characterizing them and predicting their outbursts and durations, in mainstream media or social media platforms.
Counteracting Concept Drift by Learning with Future Malware Predictions
Authors: Authors: Branislav Bosansky, Lada Hospodkova, Michal Najman, Maria Rigaki, Elnaz Babayeva, Viliam Lisy
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.09352
Pdf link: https://arxiv.org/pdf/2404.09352
Abstract The accuracy of deployed malware-detection classifiers degrades over time due to changes in data distributions and increasing discrepancies between training and testing data. This phenomenon is known as the concept drift. While the concept drift can be caused by various reasons in general, new malicious files are created by malware authors with a clear intention of avoiding detection. The existence of the intention opens a possibility for predicting such future samples. Including predicted samples in training data should consequently increase the accuracy of the classifiers on new testing data. We compare two methods for predicting future samples: (1) adversarial training and (2) generative adversarial networks (GANs). The first method explicitly seeks for adversarial examples against the classifier that are then used as a part of training data. Similarly, GANs also generate synthetic training data. We use GANs to learn changes in data distributions within different time periods of training data and then apply these changes to generate samples that could be in testing data. We compare these prediction methods on two different datasets: (1) Ember public dataset and (2) the internal dataset of files incoming to Avast. We show that while adversarial training yields more robust classifiers, this method is not a good predictor of future malware in general. This is in contrast with previously reported positive results in different domains (including natural language processing and spam detection). On the other hand, we show that GANs can be successfully used as predictors of future malware. We specifically examine malware families that exhibit significant changes in their data distributions over time and the experimental results confirm that GAN-based predictions can significantly improve the accuracy of the classifier on new, previously unseen data.
Exploring Feedback Generation in Automated Skeletal Movement Assessment: A Comprehensive Overview
Authors: Authors: Tal Hakim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09359
Pdf link: https://arxiv.org/pdf/2404.09359
Abstract The application of machine-learning solutions to movement assessment from skeleton videos has attracted significant research attention in recent years. This advancement has made rehabilitation at home more accessible, utilizing movement assessment algorithms that can operate on affordable equipment for human pose detection from 2D or 3D videos. While the primary objective of automatic assessment tasks is to score movements, the automatic generation of feedback highlighting key movement issues has the potential to significantly enhance and accelerate the rehabilitation process. In this study, we explain the types of feedback that can be generated, review existing solutions for automatic feedback generation, and discuss future research directions. To our knowledge, this is the first comprehensive review of feedback generation in skeletal movement assessment.
Tasks People Prompt: A Taxonomy of LLM Downstream Tasks in Software Verification and Falsification Approaches
Authors: Authors: Víctor A. Braberman, Flavia Bonomo-Braberman, Yiannis Charalambous, Juan G. Colonna, Lucas C. Cordeiro, Rosiane de Freitas
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09384
Pdf link: https://arxiv.org/pdf/2404.09384
Abstract Prompting has become one of the main approaches to leverage emergent capabilities of Large Language Models [Brown et al. NeurIPS 2020, Wei et al. TMLR 2022, Wei et al. NeurIPS 2022]. During the last year, researchers and practitioners have been playing with prompts to see how to make the most of LLMs. By homogeneously dissecting 80 papers, we investigate in deep how software testing and verification research communities have been abstractly architecting their LLM-enabled solutions. More precisely, first, we want to validate whether downstream tasks are an adequate concept to convey the blueprint of prompt-based solutions. We also aim at identifying number and nature of such tasks in solutions. For such goal, we develop a novel downstream task taxonomy that enables pinpointing some engineering patterns in a rather varied spectrum of Software Engineering problems that encompasses testing, fuzzing, debugging, vulnerability detection, static analysis and program verification approaches.
Identification of cardiovascular diseases through ECG classification using wavelet transformation
Authors: Authors: Morteza Maleki
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2404.09393
Pdf link: https://arxiv.org/pdf/2404.09393
Abstract Cardiovascular diseases are the leading cause of mortality globally, necessitating advancements in diagnostic techniques. This study explores the application of wavelet transformation for classifying electrocardiogram (ECG) signals to identify various cardiovascular conditions. Utilizing the MIT-BIH Arrhythmia Database, we employed both continuous and discrete wavelet transforms to decompose ECG signals into frequency sub-bands, from which we extracted eight statistical features per band. These features were then used to train and test various classifiers, including K-Nearest Neighbors and Support Vector Machines, among others. The classifiers demonstrated high efficacy, with some achieving an accuracy of up to 96% on test data, suggesting that wavelet-based feature extraction significantly enhances the prediction of cardiovascular abnormalities in ECG data. The findings advocate for further exploration of wavelet transforms in medical diagnostics to improve automation and accuracy in disease detection. Future work will focus on optimizing feature selection and classifier parameters to refine predictive performance further.
VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection
Authors: Authors: Bonan Ding, Jin Xie, Jing Nie, Jiale Cao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09431
Pdf link: https://arxiv.org/pdf/2404.09431
Abstract Due to its cost-effectiveness and widespread availability, monocular 3D object detection, which relies solely on a single camera during inference, holds significant importance across various applications, including autonomous driving and robotics. Nevertheless, directly predicting the coordinates of objects in 3D space from monocular images poses challenges. Therefore, an effective solution involves transforming monocular images into LiDAR-like representations and employing a LiDAR-based 3D object detector to predict the 3D coordinates of objects. The key step in this method is accurately converting the monocular image into a reliable point cloud form. In this paper, we present VFMM3D, an innovative approach that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations. VFMM3D utilizes the Segment Anything Model (SAM) and Depth Anything Model (DAM) to generate high-quality pseudo-LiDAR data enriched with rich foreground information. Specifically, the Depth Anything Model (DAM) is employed to generate dense depth maps. Subsequently, the Segment Anything Model (SAM) is utilized to differentiate foreground and background regions by predicting instance masks. These predicted instance masks and depth maps are then combined and projected into 3D space to generate pseudo-LiDAR points. Finally, any object detectors based on point clouds can be utilized to predict the 3D coordinates of objects. Comprehensive experiments are conducted on the challenging 3D object detection dataset KITTI. Our VFMM3D establishes a new state-of-the-art performance. Additionally, experimental results demonstrate the generality of VFMM3D, showcasing its seamless integration into various LiDAR-based 3D object detectors.
The 8th AI City Challenge
Authors: Authors: Shuo Wang, David C. Anastasiu, Zheng Tang, Ming-Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S. Arya, Anuj Sharma, Pranamesh Chakraborty, Sanjita Prajapati, Quan Kong, Norimasa Kobori, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Fady Alnajjar, Ganzorig Batnasan, Ping-Yang Chen, Jun-Wei Hsieh, Xunlei Wu, Sameer Satish Pusegaonkar, Yizhou Wang, Sujit Biswas, Rama Chellappa
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09432
Pdf link: https://arxiv.org/pdf/2404.09432
Abstract The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities. The 2024 edition featured five tracks, attracting unprecedented interest from 726 teams in 47 countries and regions. Track 1 dealt with multi-target multi-camera (MTMC) people tracking, highlighting significant enhancements in camera count, character number, 3D annotation, and camera matrices, alongside new rules for 3D tracking and online tracking algorithm encouragement. Track 2 introduced dense video captioning for traffic safety, focusing on pedestrian accidents using multi-camera feeds to improve insights for insurance and prevention. Track 3 required teams to classify driver actions in a naturalistic driving analysis. Track 4 explored fish-eye camera analytics using the FishEye8K dataset. Track 5 focused on motorcycle helmet rule violation detection. The challenge utilized two leaderboards to showcase methods, with participants setting new benchmarks, some surpassing existing state-of-the-art achievements.
SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection
Authors: Authors: Yekai Li, Rufan Zhang, Wenxin Rong, Xianghang Mi
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09481
Pdf link: https://arxiv.org/pdf/2404.09481
Abstract In this study, we introduce SpamDam, a SMS spam detection framework designed to overcome key challenges in detecting and understanding SMS spam, such as the lack of public SMS spam datasets, increasing privacy concerns of collecting SMS data, and the need for adversary-resistant detection models. SpamDam comprises four innovative modules: an SMS spam radar that identifies spam messages from online social networks(OSNs); an SMS spam inspector for statistical analysis; SMS spam detectors(SSDs) that enable both central training and federated learning; and an SSD analyzer that evaluates model resistance against adversaries in realistic scenarios. Leveraging SpamDam, we have compiled over 76K SMS spam messages from Twitter and Weibo between 2018 and 2023, forming the largest dataset of its kind. This dataset has enabled new insights into recent spam campaigns and the training of high-performing binary and multi-label classifiers for spam detection. Furthermore, effectiveness of federated learning has been well demonstrated to enable privacy-preserving SMS spam detection. Additionally, we have rigorously tested the adversarial robustness of SMS spam detection models, introducing the novel reverse backdoor attack, which has shown effectiveness and stealthiness in practical tests.
Dynamic fault detection and diagnosis of industrial alkaline water electrolyzer process with variational Bayesian dictionary learning
Authors: Authors: Qi Zhang, Lei Xie, Weihua Xu, Hongye Su
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09524
Pdf link: https://arxiv.org/pdf/2404.09524
Abstract Alkaline Water Electrolysis (AWE) is one of the simplest green hydrogen production method using renewable energy. AWE system typically yields process variables that are serially correlated and contaminated by measurement uncertainty. A novel robust dynamic variational Bayesian dictionary learning (RDVDL) monitoring approach is proposed to improve the reliability and safety of AWE operation. RDVDL employs a sparse Bayesian dictionary learning to preserve the dynamic mechanism information of AWE process which allows the easy interpretation of fault detection results. To improve the robustness to measurement uncertainty, a low-rank vector autoregressive (VAR) method is derived to reliably extract the serial correlation from process variables. The effectiveness of the proposed approach is demonstrated with an industrial hydrogen production process, and RDVDL can efficiently detect and diagnose critical AWE faults.
RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization
Authors: Authors: Avinash Anand, Raj Jaiswal, Mohit Gupta, Siddhesh S Bangar, Pijush Bhuyan, Naman Lal, Rajeev Singh, Ritika Jha, Rajiv Ratn Shah, Shin'ichi Satoh
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.09530
Pdf link: https://arxiv.org/pdf/2404.09530
Abstract Large ground-truth datasets and recent advances in deep learning techniques have been useful for layout detection. However, because of the restricted layout diversity of these datasets, training on them requires a sizable number of annotated instances, which is both expensive and time-consuming. As a result, differences between the source and target domains may significantly impact how well these models function. To solve this problem, domain adaptation approaches have been developed that use a small quantity of labeled data to adjust the model to the target domain. In this research, we introduced a synthetic document dataset called RanLayNet, enriched with automatically assigned labels denoting spatial positions, ranges, and types of layout elements. The primary aim of this endeavor is to develop a versatile dataset capable of training models with robustness and adaptability to diverse document formats. Through empirical experimentation, we demonstrate that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents. Moreover, we conduct a comparative analysis by fine-tuning inference models using both PubLayNet and IIIT-AR-13K datasets on the Doclaynet dataset. Our findings emphasize that models enriched with our dataset are optimal for tasks such as achieving 0.398 and 0.588 mAP95 score in the scientific document domain for the TABLE class.
Machine Learning Techniques for Python Source Code Vulnerability Detection
Authors: Authors: Talaya Farasat, Joachim Posegga
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.09537
Pdf link: https://arxiv.org/pdf/2404.09537
Abstract Software vulnerabilities are a fundamental reason for the prevalence of cyber attacks and their identification is a crucial yet challenging problem in cyber security. In this paper, we apply and compare different machine learning algorithms for source code vulnerability detection specifically for Python programming language. Our experimental evaluation demonstrates that our Bidirectional Long Short-Term Memory (BiLSTM) model achieves a remarkable performance (average Accuracy = 98.6%, average F-Score = 94.7%, average Precision = 96.2%, average Recall = 93.3%, average ROC = 99.3%), thereby, establishing a new benchmark for vulnerability detection in Python source code.
Characterization and Mitigation of Insufficiencies in Automated Driving Systems
Authors: Authors: Yuting Fu, Jochen Seemann, Caspar Hanselaar, Tim Beurskens, Andrei Terechko, Emilia Silvas, Maurice Heemels
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2404.09557
Pdf link: https://arxiv.org/pdf/2404.09557
Abstract Automated Driving (AD) systems have the potential to increase safety, comfort and energy efficiency. Recently, major automotive companies have started testing and validating AD systems (ADS) on public roads. Nevertheless, the commercial deployment and wide adoption of ADS have been moderate, partially due to system functional insufficiencies (FI) that undermine passenger safety and lead to hazardous situations on the road. FIs are defined in ISO 21448 Safety Of The Intended Functionality (SOTIF). FIs are insufficiencies in sensors, actuators and algorithm implementations, including neural networks and probabilistic calculations. Examples of FIs in ADS include inaccurate ego-vehicle localization on the road, incorrect prediction of a cyclist maneuver, unreliable detection of a pedestrian, etc. The main goal of our study is to formulate a generic architectural design pattern, which is compatible with existing methods and ADS, to improve FI mitigation and enable faster commercial deployment of ADS. First, we studied the 2021 autonomous vehicles disengagement reports published by the California Department of Motor Vehicles (DMV). The data clearly show that disengagements are five times more often caused by FIs rather than by system faults. We then made a comprehensive list of insufficiencies and their characteristics by analyzing over 10 hours of publicly available road test videos. In particular, we identified insufficiency types in four major categories: world model, motion plan, traffic rule, and operational design domain. The insufficiency characterization helps making the SOTIF analyses of triggering conditions more systematic and comprehensive. Based on our FI characterization, simulation experiments and literature survey, we define a novel generic architectural design pattern Daruma to dynamically select the channel that is least likely to have a FI at the moment.
Joint Contrastive Learning with Feature Alignment for Cross-Corpus EEG-based Emotion Recognition
Authors: Authors: Qile Liu, Zhihao Zhou, Jiyuan Wang, Zhen Liang
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.09559
Pdf link: https://arxiv.org/pdf/2404.09559
Abstract The integration of human emotions into multimedia applications shows great potential for enriching user experiences and enhancing engagement across various digital platforms. Unlike traditional methods such as questionnaires, facial expressions, and voice analysis, brain signals offer a more direct and objective understanding of emotional states. However, in the field of electroencephalography (EEG)-based emotion recognition, previous studies have primarily concentrated on training and testing EEG models within a single dataset, overlooking the variability across different datasets. This oversight leads to significant performance degradation when applying EEG models to cross-corpus scenarios. In this study, we propose a novel Joint Contrastive learning framework with Feature Alignment (JCFA) to address cross-corpus EEG-based emotion recognition. The JCFA model operates in two main stages. In the pre-training stage, a joint domain contrastive learning strategy is introduced to characterize generalizable time-frequency representations of EEG signals, without the use of labeled data. It extracts robust time-based and frequency-based embeddings for each EEG sample, and then aligns them within a shared latent time-frequency space. In the fine-tuning stage, JCFA is refined in conjunction with downstream tasks, where the structural connections among brain electrodes are considered. The model capability could be further enhanced for the application in emotion detection and interpretation. Extensive experimental results on two well-recognized emotional datasets show that the proposed JCFA model achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy increase of 4.09% in cross-corpus EEG-based emotion recognition tasks.
Reliability Estimation of News Media Sources: Birds of a Feather Flock Together
Authors: Authors: Sergio Burdisso, Dairazalia Sánchez-Cortés, Esaú Villatoro-Tello, Petr Motlicek
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09565
Pdf link: https://arxiv.org/pdf/2404.09565
Abstract Evaluating the reliability of news sources is a routine task for journalists and organizations committed to acquiring and disseminating accurate information. Recent research has shown that predicting sources' reliability represents an important first-prior step in addressing additional challenges such as fake news detection and fact-checking. In this paper, we introduce a novel approach for source reliability estimation that leverages reinforcement learning strategies for estimating the reliability degree of news sources. Contrary to previous research, our proposed approach models the problem as the estimation of a reliability degree, and not a reliability label, based on how all the news media sources interact with each other on the Web. We validated the effectiveness of our method on a news media reliability dataset that is an order of magnitude larger than comparable existing datasets. Results show that the estimated reliability degrees strongly correlates with journalists-provided scores (Spearman=0.80) and can effectively predict reliability labels (macro-avg. F$_1$ score=81.05). We release our implementation and dataset, aiming to provide a valuable resource for the NLP community working on information verification.
Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation
Authors: Authors: Shangqing Liu, Wei Ma, Jian Wang, Xiaofei Xie, Ruitao Feng, Yang Liu
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2404.09599
Pdf link: https://arxiv.org/pdf/2404.09599
Abstract Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks. Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task for example determining whether it is vulnerable or not. This poses a challenge for a single deep learning-based model to effectively learn the wide array of vulnerability characteristics. Furthermore, due to the challenges associated with collecting large-scale vulnerability data, these detectors often overfit limited training datasets, resulting in lower model generalization performance. To address the aforementioned challenges, in this work, we introduce a fine-grained vulnerability detector namely FGVulDet. Unlike previous approaches, FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability. Each classifier is designed to learn type-specific vulnerability semantics. Additionally, to address the scarcity of data for some vulnerability types and enhance data diversity for learning better vulnerability semantics, we propose a novel vulnerability-preserving data augmentation technique to augment the number of vulnerabilities. Taking inspiration from recent advancements in graph neural networks for learning program semantics, we incorporate a Gated Graph Neural Network (GGNN) and extend it to an edge-aware GGNN to capture edge-type information. FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities. Extensive experiments compared with static-analysis-based approaches and learning-based approaches have demonstrated the effectiveness of FGVulDet.
Privacy-Preserving Intrusion Detection using Convolutional Neural Networks
Authors: Authors: Martin Kodys, Zhongmin Dai, Vrizlynn L. L. Thing
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09625
Pdf link: https://arxiv.org/pdf/2404.09625
Abstract Privacy-preserving analytics is designed to protect valuable assets. A common service provision involves the input data from the client and the model on the analyst's side. The importance of the privacy preservation is fuelled by legal obligations and intellectual property concerns. We explore the use case of a model owner providing an analytic service on customer's private data. No information about the data shall be revealed to the analyst and no information about the model shall be leaked to the customer. Current methods involve costs: accuracy deterioration and computational complexity. The complexity, in turn, results in a longer processing time, increased requirement on computing resources, and involves data communication between the client and the server. In order to deploy such service architecture, we need to evaluate the optimal setting that fits the constraints. And that is what this paper addresses. In this work, we enhance an attack detection system based on Convolutional Neural Networks with privacy-preserving technology based on PriMIA framework that is initially designed for medical data.
Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection
Authors: Authors: Jiaqi Zhu, Shaofeng Cai, Fang Deng, Junran Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2404.09654
Pdf link: https://arxiv.org/pdf/2404.09654
Abstract Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.
Search-Space Reduction Via Essential Vertices Revisited: Vertex Multicut and Cograph Deletion
Authors: Authors: Bart M. P. Jansen, Ruben F. A. Verhaegh
Subjects: Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2404.09769
Pdf link: https://arxiv.org/pdf/2404.09769
Abstract For an optimization problem $\Pi$ on graphs whose solutions are vertex sets, a vertex $v$ is called $c$-essential for $\Pi$ if all solutions of size at most $c \cdot OPT$ contain $v$. Recent work showed that polynomial-time algorithms to detect $c$-essential vertices can be used to reduce the search space of fixed-parameter tractable algorithms solving such problems parameterized by the size $k$ of the solution. We provide several new upper- and lower bounds for detecting essential vertices. For example, we give a polynomial-time algorithm for $3$-Essential detection for Vertex Multicut, which translates into an algorithm that finds a minimum multicut of an undirected $n$-vertex graph $G$ in time $2^{O(\ell^3)} \cdot n^{O(1)}$, where $\ell$ is the number of vertices in an optimal solution that are not $3$-essential. Our positive results are obtained by analyzing the integrality gaps of certain linear programs. Our lower bounds show that for sufficiently small values of $c$, the detection task becomes NP-hard assuming the Unique Games Conjecture. For example, we show that ($2-\varepsilon$)-Essential detection for Directed Feedback Vertex Set is NP-hard under this conjecture, thereby proving that the existing algorithm that detects $2$-essential vertices is best-possible.
The Performance of Sequential Deep Learning Models in Detecting Phishing Websites Using Contextual Features of URLs
Authors: Authors: Saroj Gopali, Akbar S. Namin, Faranak Abri, Keith S. Jones
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09802
Pdf link: https://arxiv.org/pdf/2404.09802
Abstract Cyber attacks continue to pose significant threats to individuals and organizations, stealing sensitive data such as personally identifiable information, financial information, and login credentials. Hence, detecting malicious websites before they cause any harm is critical to preventing fraud and monetary loss. To address the increasing number of phishing attacks, protective mechanisms must be highly responsive, adaptive, and scalable. Fortunately, advances in the field of machine learning, coupled with access to vast amounts of data, have led to the adoption of various deep learning models for timely detection of these cyber crimes. This study focuses on the detection of phishing websites using deep learning models such as Multi-Head Attention, Temporal Convolutional Network (TCN), BI-LSTM, and LSTM where URLs of the phishing websites are treated as a sequence. The results demonstrate that Multi-Head Attention and BI-LSTM model outperform some other deep learning-based algorithms such as TCN and LSTM in producing better precision, recall, and F1-scores.
A Novel HARQ-CC Assisted SCMA Scheme
Authors: Authors: Man Wang, Zheng Shi, Yunfei Li, Xianda Wu, Weiqiang Tan, Xinrong Ye
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2404.09814
Pdf link: https://arxiv.org/pdf/2404.09814
Abstract This letter proposes a novel hybrid automatic repeat request with chase combining assisted sparse code multiple access (HARQ-CC-SCMA) scheme. Depending on whether the same superimposed packet are retransmitted, synchronous and asynchronous modes are considered for retransmissions. Moreover, factor graph aggregation (FGA) and Log-likelihood ratio combination (LLRC) are proposed for multi-user detection. Regarding FGA, a large-scale factor graph is constructed by combining all the received superimposed signals and message passing algorithm (MAP) is applied to calculate log-likelihood ratio (LLR). Whereas, owing to the same unsuccessful messages required to be retransmitted, LLRC adds up LLRs of erroneously received packets in previous HARQ rounds together with currently received packets for channel decoding and saves the LLRs for failed users. Finally, Monte Carlo simulations are preformed to show that FGA surpasses LLRC and HARQ with incremental redundancy (HARQ-IR) in synchronous mode. However, LLRC performs better than FGA at low signal-to-noise ratio (SNR) in asynchronous mode. This is because failed messages after the maximum allowable HARQ rounds in this mode can yield significant error propagation in low SNR regime.
Negation Triplet Extraction with Syntactic Dependency and Semantic Consistency
Authors: Authors: Yuchen Shi, Deqing Yang, Jingping Liu, Yanghua Xiao, Zongyu Wang, Huimin Xu
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.09830
Pdf link: https://arxiv.org/pdf/2404.09830
Abstract Previous works of negation understanding mainly focus on negation cue detection and scope resolution, without identifying negation subject which is also significant to the downstream tasks. In this paper, we propose a new negation triplet extraction (NTE) task which aims to extract negation subject along with negation cue and scope. To achieve NTE, we devise a novel Syntax&Semantic-Enhanced Negation Extraction model, namely SSENE, which is built based on a generative pretrained language model (PLM) {of Encoder-Decoder architecture} with a multi-task learning framework. Specifically, the given sentence's syntactic dependency tree is incorporated into the PLM's encoder to discover the correlations between the negation subject, cue and scope. Moreover, the semantic consistency between the sentence and the extracted triplet is ensured by an auxiliary task learning. Furthermore, we have constructed a high-quality Chinese dataset NegComment based on the users' reviews from the real-world platform of Meituan, upon which our evaluations show that SSENE achieves the best NTE performance compared to the baselines. Our ablation and case studies also demonstrate that incorporating the syntactic information helps the PLM's recognize the distant dependency between the subject and cue, and the auxiliary task learning is helpful to extract the negation triplets with more semantic consistency.
How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models
Authors: Authors: Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, Nenghai Yu
Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2404.09836
Pdf link: https://arxiv.org/pdf/2404.09836
Abstract Binary code analysis plays a pivotal role in various software security applications, such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, understanding binary code is challenging for reverse engineers due to the absence of semantic information. Therefore, automated tools are needed to assist human players in interpreting binary code. In recent years, two groups of technologies have shown promising prospects: (1) Deep learning-based technologies have demonstrated competitive results in tasks related to binary code understanding, furthermore, (2) Large Language Models (LLMs) have been extensively pre-trained at the source-code level for tasks such as code understanding and generation. This makes participants wonder about the ability of LLMs in binary code understanding. In this work, we propose a benchmark to evaluate the effectiveness of LLMs in real-world reverse engineering scenarios. The benchmark covers two key binary code understanding tasks, including function name recovery and binary code summarization. We gain valuable insights into their capabilities and limitations through extensive evaluations of popular LLMs using our benchmark. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis. Our results highlight the great potential of the LLMs in advancing the field of binary code understanding.
STMixer: A One-Stage Sparse Action Detector
Authors: Authors: Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, Limin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09842
Pdf link: https://arxiv.org/pdf/2404.09842
Abstract Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.
Explainable Online Unsupervised Anomaly Detection for Cyber-Physical Systems via Causal Discovery from Time Series
Authors: Authors: Daniele Meli
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2404.09871
Pdf link: https://arxiv.org/pdf/2404.09871
Abstract Online unsupervised detection of anomalies is crucial to guarantee the correct operation of cyber-physical systems and the safety of humans interacting with them. State-of-the-art approaches based on deep learning via neural networks achieve outstanding performance at anomaly recognition, evaluating the discrepancy between a normal model of the system (with no anomalies) and the real-time stream of sensor time series. However, large training data and time are typically required, and explainability is still a challenge to identify the root of the anomaly and implement predictive maintainance. In this paper, we use causal discovery to learn a normal causal graph of the system, and we evaluate the persistency of causal links during real-time acquisition of sensor data to promptly detect anomalies. On two benchmark anomaly detection datasets, we show that our method has higher training efficiency, outperforms the accuracy of state-of-the-art neural architectures and correctly identifies the sources of $>10$ different anomalies. The code for experimental replication is at this http URL
Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection
Authors: Authors: Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, Haoyu Wang
Subjects: Computation and Language (cs.CL); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2404.09894
Pdf link: https://arxiv.org/pdf/2404.09894
Abstract With the expanding application of Large Language Models (LLMs) in various domains, it becomes imperative to comprehensively investigate their unforeseen behaviors and consequent outcomes. In this study, we introduce and systematically explore the phenomenon of "glitch tokens", which are anomalous tokens produced by established tokenizers and could potentially compromise the models' quality of response. Specifically, we experiment on seven top popular LLMs utilizing three distinct tokenizers and involving a totally of 182,517 tokens. We present categorizations of the identified glitch tokens and symptoms exhibited by LLMs when interacting with glitch tokens. Based on our observation that glitch tokens tend to cluster in the embedding space, we propose GlitchHunter, a novel iterative clustering-based technique, for efficient glitch token detection. The evaluation shows that our approach notably outperforms three baseline methods on eight open-source LLMs. To the best of our knowledge, we present the first comprehensive study on glitch tokens. Our new detection further provides valuable insights into mitigating tokenization-related errors in LLMs.
Novel Joint Estimation and Decoding Metrics for Short-Block length Transmission Systems
Authors: Authors: Mody Sy, Raymond Knopp
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2404.09943
Pdf link: https://arxiv.org/pdf/2404.09943
Abstract This paper presents Bit-Interleaved Coded Modulation metrics for joint estimation detection using training or reference signal transmission strategies for short to long block length channels. We show that it is possible to enhance the performance and sensitivity through joint detection-estimation compared to standard receivers, especially when the channel state information is unknown and the density of the training dimensions is low. The performance analysis makes use of a full 5G transmitter and receiver chains for both Polar and LDPC coded transmissions paired with BPSK/QPSK modulation schemes. We consider transmissions where reference signals are interleaved with data and both are transmitted over a small number of OFDM symbols so that near-perfect channel estimation cannot be achieved. This is particularly adapted to mini-slot transmissions for ultra-reliable, low-latency communications (URLLC) or for short packet random access use cases. We characterize the performance for up to eight receiving antennas in order to determine the performance gain offered by the proposed BICM detection in realistic base station receiver scenarios. Our findings demonstrate that when the detection windows used in the metric units is on the order of four modulated symbols the proposed BICM metrics can be used to achieve detection performance that is close to that of a coherent receiver with perfect channel state information for both polar and LDPC coded configurations. Furthermore, we show that for transmissions with low DMRS density, a good trade-off can be achieved in terms of additional coding gain and improved channel estimation quality by adaptive DMRS power adjustment.
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
Authors: Authors: Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09971
Pdf link: https://arxiv.org/pdf/2404.09971
Abstract Large language models (LLMs) are susceptible to hallucination, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's computation during generation, using different setups and heuristics. Those works lack separation between different hallucination causes. In this work, we first introduce an approach for constructing datasets based on the model knowledge for detection and intervention methods in closed-book and open-book question-answering settings. We then characterize the effect of different choices for intervention, such as the intervened components (MLPs, attention block, residual stream, and specific heads), and how often and how strongly to intervene. We find that intervention success varies depending on the component, with some components being detrimental to language modeling capabilities. Finally, we find that interventions can benefit from pre-hallucination steering direction instead of post-hallucination. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation
Keyword: face recognition

FaceCat: Enhancing Face Recognition Security with a Unified Generative Model Framework
Authors: Authors: Jiawei Chen, Xiao Yang, Yinpeng Dong, Hang Su, Jianteng Peng, Zhaoxia Yin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09193
Pdf link: https://arxiv.org/pdf/2404.09193
Abstract Face anti-spoofing (FAS) and adversarial detection (FAD) have been regarded as critical technologies to ensure the safety of face recognition systems. As a consequence of their limited practicality and generalization, some existing methods aim to devise a framework capable of concurrently detecting both threats to address the challenge. Nevertheless, these methods still encounter challenges of insufficient generalization and suboptimal robustness, potentially owing to the inherent drawback of discriminative models. Motivated by the rich structural and detailed features of face generative models, we propose FaceCat which utilizes the face generative model as a pre-trained model to improve the performance of FAS and FAD. Specifically, FaceCat elaborately designs a hierarchical fusion mechanism to capture rich face semantic features of the generative model. These features then serve as a robust foundation for a lightweight head, designed to execute FAS and FAD tasks simultaneously. As relying solely on single-modality data often leads to suboptimal performance, we further propose a novel text-guided multi-modal alignment strategy that utilizes text prompts to enrich feature representation, thereby enhancing performance. For fair evaluations, we build a comprehensive protocol with a wide range of 28 attack types to benchmark the performance. Extensive experiments validate the effectiveness of FaceCat generalizes significantly better and obtains excellent robustness against input transformations.
AI-KD: Towards Alignment Invariant Face Image Quality Assessment Using Knowledge Distillation
Authors: Authors: Žiga Babnik, Fadi Boutros, Naser Damer, Peter Peer, Vitomir Štruc
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09555
Pdf link: https://arxiv.org/pdf/2404.09555
Abstract Face Image Quality Assessment (FIQA) techniques have seen steady improvements over recent years, but their performance still deteriorates if the input face samples are not properly aligned. This alignment sensitivity comes from the fact that most FIQA techniques are trained or designed using a specific face alignment procedure. If the alignment technique changes, the performance of most existing FIQA techniques quickly becomes suboptimal. To address this problem, we present in this paper a novel knowledge distillation approach, termed AI-KD that can extend on any existing FIQA technique, improving its robustness to alignment variations and, in turn, performance with different alignment procedures. To validate the proposed distillation approach, we conduct comprehensive experiments on 6 face datasets with 4 recent face recognition models and in comparison to 7 state-of-the-art FIQA techniques. Our results show that AI-KD consistently improves performance of the initial FIQA techniques not only with misaligned samples, but also with properly aligned facial images. Furthermore, it leads to a new state-of-the-art, when used with a competitive initial FIQA approach. The code for AI-KD is made publicly available from: https://github.com/LSIbabnikz/AI-KD.
Keyword: augmentation

How Does Message Passing Improve Collaborative Filtering?
Authors: Authors: Mingxuan Ju, William Shiao, Zhichun Guo, Yanfang Ye, Yozen Liu, Neil Shah, Tong Zhao
Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08660
Pdf link: https://arxiv.org/pdf/2404.08660
Abstract Collaborative filtering (CF) has exhibited prominent results for recommender systems and been broadly utilized for real-world applications. A branch of research enhances CF methods by message passing used in graph neural networks, due to its strong capabilities of extracting knowledge from graph-structured data, like user-item bipartite graphs that naturally exist in CF. They assume that message passing helps CF methods in a manner akin to its benefits for graph-based learning tasks in general. However, even though message passing empirically improves CF, whether or not this assumption is correct still needs verification. To address this gap, we formally investigate why message passing helps CF from multiple perspectives and show that many assumptions made by previous works are not entirely accurate. With our curated ablation studies and theoretical analyses, we discover that (1) message passing improves the CF performance primarily by additional representations passed from neighbors during the forward pass instead of additional gradient updates to neighbor representations during the model back-propagation and (ii) message passing usually helps low-degree nodes more than high-degree nodes. Utilizing these novel findings, we present Test-time Aggregation for CF, namely TAG-CF, a test-time augmentation framework that only conducts message passing once at inference time. The key novelty of TAG-CF is that it effectively utilizes graph knowledge while circumventing most of notorious computational overheads of message passing. Besides, TAG-CF is extremely versatile can be used as a plug-and-play module to enhance representations trained by different CF supervision signals. Evaluated on six datasets, TAG-CF consistently improves the recommendation performance of CF methods without graph by up to 39.2% on cold users and 31.7% on all users, with little to no extra computational overheads.
Text clustering applied to data augmentation in legal contexts
Authors: Authors: Lucas José Gonçalves Freitas, Thaís Rodrigues, Guilherme Rodrigues, Pamella Edokawa, Ariane Farias
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08683
Pdf link: https://arxiv.org/pdf/2404.08683
Abstract Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.
Single-image driven 3d viewpoint training data augmentation for effective wine label recognition
Authors: Authors: Yueh-Cheng Huang, Hsin-Yi Chen, Cheng-Jui Hung, Jen-Hui Chuang, Jenq-Neng Hwang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.08820
Pdf link: https://arxiv.org/pdf/2404.08820
Abstract Confronting the critical challenge of insufficient training data in the field of complex image recognition, this paper introduces a novel 3D viewpoint augmentation technique specifically tailored for wine label recognition. This method enhances deep learning model performance by generating visually realistic training samples from a single real-world wine label image, overcoming the challenges posed by the intricate combinations of text and logos. Classical Generative Adversarial Network (GAN) methods fall short in synthesizing such intricate content combination. Our proposed solution leverages time-tested computer vision and image processing strategies to expand our training dataset, thereby broadening the range of training samples for deep learning applications. This innovative approach to data augmentation circumvents the constraints of limited training resources. Using the augmented training images through batch-all triplet metric learning on a Vision Transformer (ViT) architecture, we can get the most discriminative embedding features for every wine label, enabling us to perform one-shot recognition of existing wine labels in the training classes or future newly collected wine labels unavailable in the training. Experimental results show a significant increase in recognition accuracy over conventional 2D data augmentation techniques.
A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera
Authors: Authors: Yan Ru Pei, Sasskia Brüers, Sébastien Crouzet, Douglas McLelland, Olivier Coenen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08858
Pdf link: https://arxiv.org/pdf/2404.08858
Abstract Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.
An evaluation framework for synthetic data generation models
Authors: Authors: Ioannis E. Livieris, Nikos Alimpertis, George Domalis, Dimitris Tsakalidis
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.08866
Pdf link: https://arxiv.org/pdf/2404.08866
Abstract Nowadays, the use of synthetic data has gained popularity as a cost-efficient strategy for enhancing data augmentation for improving machine learning models performance as well as addressing concerns related to sensitive data privacy. Therefore, the necessity of ensuring quality of generated synthetic data, in terms of accurate representation of real data, consists of primary importance. In this work, we present a new framework for evaluating synthetic data generation models' ability for developing high-quality synthetic data. The proposed approach is able to provide strong statistical and theoretical information about the evaluation framework and the compared models' ranking. Two use case scenarios demonstrate the applicability of the proposed framework for evaluating the ability of synthetic data generation models to generated high quality data. The implementation code can be found in https://github.com/novelcore/synthetic_data_evaluation_framework.
DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector
Authors: Authors: Johan Edstedt, Georg Bökman, Zhenjun Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.08928
Pdf link: https://arxiv.org/pdf/2404.08928
Abstract In this paper, we analyze and improve into the recently proposed DeDoDe keypoint detector. We focus our analysis on some key issues. First, we find that DeDoDe keypoints tend to cluster together, which we fix by performing non-max suppression on the target distribution of the detector during training. Second, we address issues related to data augmentation. In particular, the DeDoDe detector is sensitive to large rotations. We fix this by including 90-degree rotations as well as horizontal flips. Finally, the decoupled nature of the DeDoDe detector makes evaluation of downstream usefulness problematic. We fix this by matching the keypoints with a pretrained dense matcher (RoMa) and evaluating two-view pose estimates. We find that the original long training is detrimental to performance, and therefore propose a much shorter training schedule. We integrate all these improvements into our proposed detector DeDoDe v2 and evaluate it with the original DeDoDe descriptor on the MegaDepth-1500 and IMC2022 benchmarks. Our proposed detector significantly increases pose estimation results, notably from 75.9 to 78.3 mAA on the IMC2022 challenge. Code and weights are available at https://github.com/Parskatt/DeDoDe
Improving Personalisation in Valence and Arousal Prediction using Data Augmentation
Authors: Authors: Munachiso Nwadike, Jialin Li, Hanan Salam
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09042
Pdf link: https://arxiv.org/pdf/2404.09042
Abstract In the field of emotion recognition and Human-Machine Interaction (HMI), personalised approaches have exhibited their efficacy in capturing individual-specific characteristics and enhancing affective prediction accuracy. However, personalisation techniques often face the challenge of limited data for target individuals. This paper presents our work on an enhanced personalisation strategy, that leverages data augmentation to develop tailored models for continuous valence and arousal prediction. Our proposed approach, Distance Weighting Augmentation (DWA), employs a weighting-based augmentation method that expands a target individual's dataset, leveraging distance metrics to identify similar samples at the segment-level. Experimental results on the MuSe-Personalisation 2023 Challenge dataset demonstrate that our method significantly improves the performance of features sets which have low baseline performance, on the test set. This improvement in poor-performing features comes without sacrificing performance on high-performing features. In particular, our method achieves a maximum combined testing CCC of 0.78, compared to the reported baseline score of 0.76 (reproduced at 0.72). It also achieved a peak arousal and valence scores of 0.81 and 0.76, compared to reproduced baseline scores of 0.76 and 0.67 respectively. Through this work, we make significant contributions to the advancement of personalised affective computing models, enhancing the practicality and adaptability of data-level personalisation in real world contexts.
GCC: Generative Calibration Clustering
Authors: Authors: Haifeng Xia, Hai Huang, Zhengming Ding
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09115
Pdf link: https://arxiv.org/pdf/2404.09115
Abstract Deep clustering as an important branch of unsupervised representation learning focuses on embedding semantically similar samples into the identical feature space. This core demand inspires the exploration of contrastive learning and subspace clustering. However, these solutions always rely on the basic assumption that there are sufficient and category-balanced samples for generating valid high-level representation. This hypothesis actually is too strict to be satisfied for real-world applications. To overcome such a challenge, the natural strategy is utilizing generative models to augment considerable instances. How to use these novel samples to effectively fulfill clustering performance improvement is still difficult and under-explored. In this paper, we propose a novel Generative Calibration Clustering (GCC) method to delicately incorporate feature learning and augmentation into clustering procedure. First, we develop a discriminative feature alignment mechanism to discover intrinsic relationship across real and generated samples. Second, we design a self-supervised metric learning to generate more reliable cluster assignment to boost the conditional diffusion generation. Extensive experimental results on three benchmarks validate the effectiveness and advantage of our proposed method over the state-of-the-art methods.
DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness
Authors: Authors: Yuqi Wang, Zeqiang Wang, Wei Wang, Qi Chen, Kaizhu Huang, Anh Nguyen, Suparna De
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09206
Pdf link: https://arxiv.org/pdf/2404.09206
Abstract Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.
PANet: A Physics-guided Parametric Augmentation Net for Image Dehazing by Hazing
Authors: Authors: Chih-Ling Chang, Fu-Jen Tsai, Zi-Ling Huang, Lin Gu, Chia-Wen Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09269
Pdf link: https://arxiv.org/pdf/2404.09269
Abstract Image dehazing faces challenges when dealing with hazy images in real-world scenarios. A huge domain gap between synthetic and real-world haze images degrades dehazing performance in practical settings. However, collecting real-world image datasets for training dehazing models is challenging since both hazy and clean pairs must be captured under the same conditions. In this paper, we propose a Physics-guided Parametric Augmentation Network (PANet) that generates photo-realistic hazy and clean training pairs to effectively enhance real-world dehazing performance. PANet comprises a Haze-to-Parameter Mapper (HPM) to project hazy images into a parameter space and a Parameter-to-Haze Mapper (PHM) to map the resampled haze parameters back to hazy images. In the parameter space, we can pixel-wisely resample individual haze parameter maps to generate diverse hazy images with physically-explainable haze conditions unseen in the training set. Our experimental results demonstrate that PANet can augment diverse realistic hazy images to enrich existing hazy image benchmarks so as to effectively boost the performances of state-of-the-art image dehazing models.
RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion
Authors: Authors: Kyle Shih-Huang Lo, Jörg Peters, Eric Spellman
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2404.09290
Pdf link: https://arxiv.org/pdf/2404.09290
Abstract Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and can so handle up to 99\% point sparsity and 80\% roof area occlusion (regional incompleteness). A variant, No-FP RoofDiffusion, simultaneously predicts building footprints and heights. Both quantitatively outperform state-of-the-art unguided depth completion and representative inpainting methods for Digital Elevation Models (DEM), on both a roof-specific benchmark and the BuildingNet dataset. Qualitative assessments show the effectiveness of RoofDiffusion for datasets with real-world scans including AHN3, Dales3D, and USGS 3DEP LiDAR. Tested with the leading City3D algorithm, preprocessing height maps with RoofDiffusion noticeably improves 3D building reconstruction. RoofDiffusion is complemented by a new dataset of 13k complex roof geometries, focusing on long-tail issues in remote sensing; a novel simulation of tree occlusion; and a wide variety of large-area roof cut-outs for data augmentation and benchmarking.
Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation?
Authors: Authors: Dmitry Ignatov, Andrey Ignatov, Radu Timofte
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09469
Pdf link: https://arxiv.org/pdf/2404.09469
Abstract We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation. In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images. We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset. By conducting extensive experiments with our virtually modified dataset and validating on the original NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures, especially for the current state-of-the-art VPD model. To the best of our knowledge, this is the first work that augments a real-world dataset with randomly generated virtual 3D objects for monocular depth estimation. We make our ANYU dataset publicly available in two training configurations with 10% and 100% additional synthetically enriched RGB-D pairs of training images, respectively, for efficient training and empirical exploration of virtual augmentation at https://github.com/ABrain-One/ANYU
Deep image learning of quantitative structure-property relationships of cooper alloys via feature augmentation on Geodesic curve in shape space
Authors: Authors: Yuexing Han, Guanxin Wan, Bing Wang, Yi Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09515
Pdf link: https://arxiv.org/pdf/2404.09515
Abstract Understanding how the structure of materials affects their properties is a cornerstone of materials science and engineering. However, traditional methods have struggled to accurately describe the quantitative structure-property relationships for complex structures. In our study, we bridge this gap by leveraging machine learning to analyze images of materials' microstructures, thus offering a novel way to understand and predict the properties of materials based on their microstructures. We introduce a method known as FAGC (Feature Augmentation on Geodesic Curves), specifically demonstrated for Cu-Cr-Zr alloys. This approach utilizes machine learning to examine the shapes within images of the alloys' microstructures and predict their mechanical and electronic properties. This generative FAGC approach can effectively expand the relatively small training datasets due to the limited availability of materials images labeled with quantitative properties. The process begins with extracting features from the images using neural networks. These features are then mapped onto the Pre-shape space to construct the Geodesic curves. Along these curves, new features are generated, effectively increasing the dataset. Moreover, we design a pseudo-labeling mechanism for these newly generated features to further enhance the training dataset. Our FAGC method has shown remarkable results, significantly improving the accuracy of predicting the electronic conductivity and hardness of Cu-Cr-Zr alloys, with R-squared values of 0.978 and 0.998, respectively. These outcomes underscore the potential of FAGC to address the challenge of limited image data in materials science, providing a powerful tool for establishing detailed and quantitative relationships between complex microstructures and material properties.
Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation
Authors: Authors: Shangqing Liu, Wei Ma, Jian Wang, Xiaofei Xie, Ruitao Feng, Yang Liu
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2404.09599
Pdf link: https://arxiv.org/pdf/2404.09599
Abstract Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks. Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task for example determining whether it is vulnerable or not. This poses a challenge for a single deep learning-based model to effectively learn the wide array of vulnerability characteristics. Furthermore, due to the challenges associated with collecting large-scale vulnerability data, these detectors often overfit limited training datasets, resulting in lower model generalization performance. To address the aforementioned challenges, in this work, we introduce a fine-grained vulnerability detector namely FGVulDet. Unlike previous approaches, FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability. Each classifier is designed to learn type-specific vulnerability semantics. Additionally, to address the scarcity of data for some vulnerability types and enhance data diversity for learning better vulnerability semantics, we propose a novel vulnerability-preserving data augmentation technique to augment the number of vulnerabilities. Taking inspiration from recent advancements in graph neural networks for learning program semantics, we incorporate a Gated Graph Neural Network (GGNN) and extend it to an edge-aware GGNN to capture edge-type information. FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities. Extensive experiments compared with static-analysis-based approaches and learning-based approaches have demonstrated the effectiveness of FGVulDet.
XoFTR: Cross-modal Feature Matching Transformer
Authors: Authors: Önder Tuzcuoğlu, Aybora Köksal, Buğra Sofu, Sinan Kalkan, A. Aydın Alatan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09692
Pdf link: https://arxiv.org/pdf/2404.09692
Abstract We introduce, XoFTR, a cross-modal cross-view method for local feature matching between thermal infrared (TIR) and visible images. Unlike visible images, TIR images are less susceptible to adverse lighting and weather conditions but present difficulties in matching due to significant texture and intensity differences. Current hand-crafted and learning-based methods for visible-TIR matching fall short in handling viewpoint, scale, and texture diversities. To address this, XoFTR incorporates masked image modeling pre-training and fine-tuning with pseudo-thermal image augmentation to handle the modality differences. Additionally, we introduce a refined matching pipeline that adjusts for scale discrepancies and enhances match reliability through sub-pixel level refinement. To validate our approach, we collect a comprehensive visible-thermal dataset, and show that our method outperforms existing methods on many benchmarks.
FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features
Authors: Authors: Andre Rochow, Max Schwarz, Sven Behnke
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09736
Pdf link: https://arxiv.org/pdf/2404.09736
Abstract The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
Can We Break Free from Strong Data Augmentations in Self-Supervised Learning?
Authors: Authors: Shruthi Gowda, Elahe Arani, Bahram Zonooz
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.09752
Pdf link: https://arxiv.org/pdf/2404.09752
Abstract Self-supervised learning (SSL) has emerged as a promising solution for addressing the challenge of limited labeled data in deep neural networks (DNNs), offering scalability potential. However, the impact of design dependencies within the SSL framework remains insufficiently investigated. In this study, we comprehensively explore SSL behavior across a spectrum of augmentations, revealing their crucial role in shaping SSL model performance and learning mechanisms. Leveraging these insights, we propose a novel learning approach that integrates prior knowledge, with the aim of curtailing the need for extensive data augmentations and thereby amplifying the efficacy of learned representations. Notably, our findings underscore that SSL models imbued with prior knowledge exhibit reduced texture bias, diminished reliance on shortcuts and augmentations, and improved robustness against both natural and adversarial corruptions. These findings not only illuminate a new direction in SSL research, but also pave the way for enhancing DNN performance while concurrently alleviating the imperative for intensive data augmentation, thereby enhancing scalability and real-world problem-solving capabilities.
A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic Counting
Authors: Authors: Tsung-Han Chou, Brian Wang, Wei-Chen Chiu, Jun-Cheng Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.09826
Pdf link: https://arxiv.org/pdf/2404.09826
Abstract Class agnostic counting (CAC) is a vision task that can be used to count the total occurrence number of any given reference objects in the query image. The task is usually formulated as a density map estimation problem through similarity computation among a few image samples of the reference object and the query image. In this paper, we point out a severe issue of the existing CAC framework: Given a multi-class setting, models don't consider reference images and instead blindly match all dominant objects in the query image. Moreover, the current evaluation metrics and dataset cannot be used to faithfully assess the model's generalization performance and robustness. To this end, we discover that the combination of mosaic augmentation with generalized loss is essential for addressing the aforementioned issue of CAC models to count objects of majority (i.e. dominant objects) regardless of the references. Furthermore, we introduce a new evaluation protocol and metrics for resolving the problem behind the existing CAC evaluation scheme and better benchmarking CAC models in a more fair manner. Besides, extensive evaluation results demonstrate that our proposed recipe can consistently improve the performance of different CAC models. The code will be released upon acceptance.
Accelerating Ensemble Error Bar Prediction with Single Models Fits
Authors: Authors: Vidit Agrawal, Shixin Zhang, Lane E. Schultz, Dane Morgan
Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
Arxiv link: https://arxiv.org/abs/2404.09896
Pdf link: https://arxiv.org/pdf/2404.09896
Abstract Ensemble models can be used to estimate prediction uncertainties in machine learning models. However, an ensemble of N models is approximately N times more computationally demanding compared to a single model when it is used for inference. In this work, we explore fitting a single model to predicted ensemble error bar data, which allows us to estimate uncertainties without the need for a full ensemble. Our approach is based on three models: Model A for predictive accuracy, Model $A{E}$ for traditional ensemble-based error bar prediction, and Model B, fit to data from Model $A{E}$, to be used for predicting the values of $A_{E}$ but with only one model evaluation. Model B leverages synthetic data augmentation to estimate error bars efficiently. This approach offers a highly flexible method of uncertainty quantification that can approximate that of ensemble methods but only requires a single extra model evaluation over Model A during inference. We assess this approach on a set of problems in materials science.
MMInA: Benchmarking Multihop Multimodal Internet Agents
Authors: Authors: Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.09992
Pdf link: https://arxiv.org/pdf/2404.09992
Abstract Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at https://mmina.cliangyu.com

LeeKyungwook / get-arxiv-noti

New submissions for Tue, 16 Apr 24 #1064

Keyword: detection

Transformer-based Joint Modelling for Automatic Essay Scoring and Off-Topic Detection

Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

Your Finetuned Large Language Model is Already a Powerful Out-of-distribution Detector

FastLogAD: Log Anomaly Detection with Mask-Guided Pseudo Anomaly Generation and Discrimination

Detecting AI-Generated Images via CLIP

Enhancing IoT Malware Detection through Adaptive Model Parallelism and Resource Optimization

E3: Ensemble of Expert Embedders for Adapting Synthetic Image Detectors to New Generators Using Limited Data

Empowering Malware Detection Efficiency within Processing-in-Memory Architecture

Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic Mapping

ChangeAnywhere: Sample Generation for Remote Sensing Change Detection via Semantic Latent Diffusion Model

Meply: A Large-scale Dataset and Baseline Evaluations for Metastatic Perirectal Lymph Node Detection and Segmentation

Label-free Anomaly Detection in Aerial Agricultural Images with Masked Image Modeling

Shifting Spotlight for Co-supervision: A Simple yet Efficient Single-branch Network to See Through Camouflage

Zero-Shot Code Representation Learning via Prompt Tuning

Timing Advance Estimation in Low Earth Orbit Satellite Networks

BG-YOLO: A Bidirectional-Guided Method for Underwater Object Detection

A Fourier-enhanced multi-modal 3D small object optical mark recognition and positioning method for percutaneous abdominal puncture surgical navigation

Towards Efficient Device Identification in Massive Random Access: A Multi-stage Approach

Fusion-Mamba for Cross-modality Object Detection

Coreset Selection for Object Detection

HANet: A Hierarchical Attention Network for Change Detection With Bitemporal Very-High-Resolution Remote Sensing Images

Change Guiding Network: Incorporating Change Prior to Guide Change Detection in Remote Sensing Imagery

FaceCat: Enhancing Face Recognition Security with a Unified Generative Model Framework

Joint Near Field Uplink Communication and Localization Using Message Passing-Based Sparse Bayesian Learning

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Fault Detection in Mobile Networks Using Diffusion Models

An Empirical Evaluation of Manually Created Equivalent Mutants

TEXT2TASTE: A Versatile Egocentric Vision System for Intelligent Reading Assistance Using Large Language Model

Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection

Reap the Wild Wind: Detecting Media Storms in Large-Scale News Corpora

Counteracting Concept Drift by Learning with Future Malware Predictions

Exploring Feedback Generation in Automated Skeletal Movement Assessment: A Comprehensive Overview

Tasks People Prompt: A Taxonomy of LLM Downstream Tasks in Software Verification and Falsification Approaches

Identification of cardiovascular diseases through ECG classification using wavelet transformation

VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection

The 8th AI City Challenge

SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection

Dynamic fault detection and diagnosis of industrial alkaline water electrolyzer process with variational Bayesian dictionary learning

RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization

Machine Learning Techniques for Python Source Code Vulnerability Detection

Characterization and Mitigation of Insufficiencies in Automated Driving Systems

Joint Contrastive Learning with Feature Alignment for Cross-Corpus EEG-based Emotion Recognition

Reliability Estimation of News Media Sources: Birds of a Feather Flock Together

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Privacy-Preserving Intrusion Detection using Convolutional Neural Networks

Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection

Search-Space Reduction Via Essential Vertices Revisited: Vertex Multicut and Cograph Deletion

The Performance of Sequential Deep Learning Models in Detecting Phishing Websites Using Contextual Features of URLs

A Novel HARQ-CC Assisted SCMA Scheme

Negation Triplet Extraction with Syntactic Dependency and Semantic Consistency

How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models

STMixer: A One-Stage Sparse Action Detector

Explainable Online Unsupervised Anomaly Detection for Cyber-Physical Systems via Causal Discovery from Time Series

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

Novel Joint Estimation and Decoding Metrics for Short-Block length Transmission Systems

Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs

Keyword: face recognition

FaceCat: Enhancing Face Recognition Security with a Unified Generative Model Framework

AI-KD: Towards Alignment Invariant Face Image Quality Assessment Using Knowledge Distillation

Keyword: augmentation

How Does Message Passing Improve Collaborative Filtering?

Text clustering applied to data augmentation in legal contexts

Single-image driven 3d viewpoint training data augmentation for effective wine label recognition

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

An evaluation framework for synthetic data generation models

DeDoDe v2: Analyzing and Improving the DeDoDe Keypoint Detector

Improving Personalisation in Valence and Arousal Prediction using Data Augmentation

GCC: Generative Calibration Clustering

DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness

PANet: A Physics-guided Parametric Augmentation Net for Image Dehazing by Hazing

RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion

Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation?

Deep image learning of quantitative structure-property relationships of cooper alloys via feature augmentation on Geodesic curve in shape space

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

XoFTR: Cross-modal Feature Matching Transformer

FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features

Can We Break Free from Strong Data Augmentations in Self-Supervised Learning?