Abstract
Extreme edge-AI systems, such as those in readout ASICs for radiation detection, must operate under stringent hardware constraints such as micron-level dimensions, sub-milliwatt power, and nanosecond-scale speed while providing clear accuracy advantages over traditional architectures. Finding ideal solutions means identifying optimal AI and ASIC design choices from a design space that has explosively expanded during the merger of these domains, creating non-trivial couplings which together act upon a small set of solutions as constraints tighten. It is impractical, if not impossible, to manually determine ideal choices among possibilities that easily exceed billions even in small-size problems. Existing methods to bridge this gap have leveraged theoretical understanding of hardware to f architecture search. However, the assumptions made in computing such theoretical metrics are too idealized to provide sufficient guidance during the difficult search for a practical implementation. Meanwhile, theoretical estimates for many other crucial metrics (like delay) do not even exist and are similarly variable, dependent on parameters of the process design kit (PDK). To address these challenges, we present a study that employs intelligent search using multi-objective Bayesian optimization, integrating both neural network search and ASIC synthesis in the loop. This approach provides reliable feedback on the collective impact of all cross-domain design choices. We showcase the effectiveness of our approach by finding several Pareto-optimal design choices for effective and efficient neural networks that perform real-time feature extraction from input pulses within the individual pixels of a readout ASIC.
Title:
Learning Visual Grounding from Generative Vision and Language Model
Authors: Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, Weicheng Kuo
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data. We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting. We thus prompt a VLM to generate object-level descriptions by feeding it object regions from existing object detection datasets. We further propose attribute modeling to explicitly capture the important object attributes, and spatial relation modeling to capture inter-object relationship, both of which are common linguistic pattern in referring expression. Our constructed dataset (500K images, 1M objects, 16M referring expressions) is one of the largest grounding datasets to date, and the first grounding dataset with purely model-generated queries and human-annotated objects. To verify the quality of this data, we conduct zero-shot transfer experiments to the popular RefCOCO benchmarks for both referring expression comprehension (REC) and segmentation (RES) tasks. On both tasks, our model significantly outperform the state-of-the-art approaches without using human annotated visual grounding data. Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world. Code and models will be released.
Title:
Operating System And Artificial Intelligence: A Systematic Review
Abstract
In the dynamic landscape of technology, the convergence of Artificial Intelligence (AI) and Operating Systems (OS) has emerged as a pivotal arena for innovation. Our exploration focuses on the symbiotic relationship between AI and OS, emphasizing how AI-driven tools enhance OS performance, security, and efficiency, while OS advancements facilitate more sophisticated AI applications. We delve into various AI techniques employed to optimize OS functionalities, including memory management, process scheduling, and intrusion detection. Simultaneously, we analyze the role of OS in providing essential services and infrastructure that enable effective AI application execution, from resource allocation to data processing. The article also addresses challenges and future directions in this domain, emphasizing the imperative of secure and efficient AI integration within OS frameworks. By examining case studies and recent developments, our review provides a comprehensive overview of the current state of AI-OS integration, underscoring its significance in shaping the next generation of computing technologies. Finally, we explore the promising prospects of Intelligent OSes, considering not only how innovative OS architectures will pave the way for groundbreaking opportunities but also how AI will significantly contribute to advancing these next-generation OSs.
Title:
The Research of Group Re-identification from Multiple Cameras
Authors: Hao Xiao
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Object re-identification is of increasing importance in visual surveillance. Most existing works focus on re-identify individual from multiple cameras while the application of group re-identification (Re-ID) is rarely discussed. We redefine Group Re-identification as a process which includes pedestrian detection, feature extraction, graph model construction, and graph matching. Group re-identification is very challenging since it is not only interfered by view-point and human pose variations in the traditional re-identification tasks, but also suffered from the challenges in group layout change and group member variation. To address the above challenges, this paper introduces a novel approach which leverages the multi-granularity information inside groups to facilitate group re-identification. We first introduce a multi-granularity Re-ID process, which derives features for multi-granularity objects (people/people-subgroups) in a group and iteratively evaluates their importances during group Re-ID, so as to handle group-wise misalignments due to viewpoint change and group dynamics. We further introduce a multi-order matching scheme. It adaptively selects representative people/people-subgroups in each group and integrates the multi-granularity information from these people/people-subgroups to obtain group-wise matching, hence achieving a more reliable matching score between groups. Experimental results on various datasets demonstrate the effectiveness of our approach.
Title:
Advancing Melanoma Diagnosis with Self-Supervised Neural Networks: Evaluating the Effectiveness of Different Techniques
Abstract
We investigate the potential of self-supervision in improving the accuracy of deep learning models trained to classify melanoma patches. Various self-supervision techniques such as rotation prediction, missing patch prediction, and corruption removal were implemented and assessed for their impact on the convolutional neural network's performance. Preliminary results suggest a positive influence of self-supervision methods on the model's accuracy. The study notably demonstrates the efficacy of the corruption removal method in enhancing model performance. Despite observable improvements, we conclude that the self-supervised models have considerable potential for further enhancement, achievable through training over more epochs or expanding the dataset. We suggest exploring other self-supervision methods like Bootstrap Your Own Latent (BYOL) and contrastive learning in future research, emphasizing the cost-benefit trade-off due to their resource-intensive nature. The findings underline the promise of self-supervision in augmenting melanoma detection capabilities of deep learning models.
Title:
A New Lightweight Hybrid Graph Convolutional Neural Network -- CNN Scheme for Scene Classification using Object Detection Inference
Abstract
Scene understanding plays an important role in several high-level computer vision applications, such as autonomous vehicles, intelligent video surveillance, or robotics. However, too few solutions have been proposed for indoor/outdoor scene classification to ensure scene context adaptability for computer vision frameworks. We propose the first Lightweight Hybrid Graph Convolutional Neural Network (LH-GCNN)-CNN framework as an add-on to object detection models. The proposed approach uses the output of the CNN object detection model to predict the observed scene type by generating a coherent GCNN representing the semantic and geometric content of the observed scene. This new method, applied to natural scenes, achieves an efficiency of over 90\% for scene classification in a COCO-derived dataset containing a large number of different scenes, while requiring fewer parameters than traditional CNN methods. For the benefit of the scientific community, we will make the source code publicly available: this https URL.
Title:
DefTesPY: Cyber defense model with enhanced data modeling and analysis for Tesla company via Python Language
Authors: Naresh Kshetri, Irin Sultana, Mir Mehedi Rahman, Darshana Shah
Subjects: Subjects:
Cryptography and Security (cs.CR)
Abstract
Several types of cyber-attacks on automobiles and business firms keep on rising as we are preparing to counter cybercrimes with several new technologies and defense models. Cyber defense (also, counter intelligence) is a computer network defense mechanism that involves response to activities, critical infrastructure protection, and information assurance for corporations, government bodies, and other conceivable networks. Cyber defense focuses on preventing, detecting, and responding to assaults or threats in a timely manner so that no infrastructure or information is compromised. With the increasing volume and complexity of cyber threats, most companies need cyber defense to protect sensitive information and assets. We can control attacker actions by utilizing firewalls at different levels, an intrusion detection system (IDS), with the intrusion prevention system (IPS) which can be installed independently or in combination with other protection approaches. Tesla is an American clean energy and automotive company in Austin, Texas, USA. The recent data breach at Tesla affected over 75,000 individuals as the company pinpoints two former employees as the offender revealing more than 23,000 internal files from 2015 to 2022. In this work, we will emphasize data modeling and data analysis using cyber defense model and python with a survey of the Tesla company. We have proposed a defense model, DefTesPY, with enhanced data modeling and data analysis based on the encountered cyber-attacks and cybercrimes for Tesla company till date.
Title:
CrowdMAC: Masked Crowd Density Completion for Robust Crowd Density Forecasting
Abstract
A crowd density forecasting task aims to predict how the crowd density map will change in the future from observed past crowd density maps. However, the past crowd density maps are often incomplete due to the miss-detection of pedestrians, and it is crucial to develop a robust crowd density forecasting model against the miss-detection. This paper presents a MAsked crowd density Completion framework for crowd density forecasting (CrowdMAC), which is simultaneously trained to forecast future crowd density maps from partially masked past crowd density maps (i.e., forecasting maps from past maps with miss-detection) while reconstructing the masked observation maps (i.e., imputing past maps with miss-detection). Additionally, we propose Temporal-Density-aware Masking (TDM), which non-uniformly masks tokens in the observed crowd density map, considering the sparsity of the crowd density maps and the informativeness of the subsequent frames for the forecasting task. Moreover, we introduce multi-task masking to enhance training efficiency. In the experiments, CrowdMAC achieves state-of-the-art performance on seven large-scale datasets, including SDD, ETH-UCY, inD, JRDB, VSCrowd, FDST, and croHD. We also demonstrate the robustness of the proposed method against both synthetic and realistic miss-detections.
Title:
Early Detection of Coffee Leaf Rust Through Convolutional Neural Networks Trained on Low-Resolution Images
Abstract
Coffee leaf rust, a foliar disease caused by the fungus Hemileia vastatrix, poses a major threat to coffee production, especially in Central America. Climate change further aggravates this issue, as it shortens the latency period between initial infection and the emergence of visible symptoms in diseases like leaf rust. Shortened latency periods can lead to more severe plant epidemics and faster spread of diseases. There is, hence, an urgent need for effective disease management strategies. To address these challenges, we explore the potential of deep learning models for enhancing early disease detection. However, deep learning models require extensive processing power and large amounts of data for model training, resources that are typically scarce. To overcome these barriers, we propose a preprocessing technique that involves convolving training images with a high-pass filter to enhance lesion-leaf contrast, significantly improving model efficacy in resource-limited environments. This method and our model demonstrated a strong performance, achieving over 90% across all evaluation metrics--including precision, recall, F1-score, and the Dice coefficient. Our experiments show that this approach outperforms other methods, including two different image preprocessing techniques and using unaltered, full-color images.
Title:
Enhancing Skin Disease Classification Leveraging Transformer-based Deep Learning Architectures and Explainable AI
Authors: Jayanth Mohan, Arrun Sivasubramanian, V Sowmya, Ravi Vinayakumar
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Skin diseases affect over a third of the global population, yet their impact is often underestimated. Automating skin disease classification to assist doctors with their prognosis might be difficult. Nevertheless, due to efficient feature extraction pipelines, deep learning techniques have shown much promise for various tasks, including dermatological disease identification. This study uses a skin disease dataset with 31 classes and compares it with all versions of Vision Transformers, Swin Transformers and DivoV2. The analysis is also extended to compare with benchmark convolution-based architecture presented in the literature. Transfer learning with ImageNet1k weights on the skin disease dataset contributes to a high test accuracy of 96.48\% and an F1-Score of 0.9727 using DinoV2, which is almost a 10\% improvement over this data's current benchmark results. The performance of DinoV2 was also compared for the HAM10000 and Dermnet datasets to test the model's robustness, and the trained model overcomes the benchmark results by a slight margin in test accuracy and in F1-Score on the 23 and 7 class datasets. The results are substantiated using explainable AI frameworks like GradCAM and SHAP, which provide precise image locations to map the disease, assisting dermatologists in early detection, prompt prognosis, and treatment.
Title:
A Two-Phase Visualization System for Continuous Human-AI Collaboration in Sequelae Analysis and Modeling
Authors: Yang Ouyang, Chenyang Zhang, He Wang, Tianle Ma, Chang Jiang, Yuheng Yan, Zuoqin Yan, Xiaojuan Ma, Chuhan Shi, Quan Li
Abstract
In healthcare, AI techniques are widely used for tasks like risk assessment and anomaly detection. Despite AI's potential as a valuable assistant, its role in complex medical data analysis often oversimplifies human-AI collaboration dynamics. To address this, we collaborated with a local hospital, engaging six physicians and one data scientist in a formative study. From this collaboration, we propose a framework integrating two-phase interactive visualization systems: one for Human-Led, AI-Assisted Retrospective Analysis and another for AI-Mediated, Human-Reviewed Iterative Modeling. This framework aims to enhance understanding and discussion around effective human-AI collaboration in healthcare.
Title:
Automatic Real-word Error Correction in Persian Text
Authors: Seyed Mohammad Sadegh Dashti, Amid Khatibi Bardsiri, Mehdi Jafari Shahbazzadeh
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
Automatic spelling correction stands as a pivotal challenge within the ambit of natural language processing (NLP), demanding nuanced solutions. Traditional spelling correction techniques are typically only capable of detecting and correcting non-word errors, such as typos and misspellings. However, context-sensitive errors, also known as real-word errors, are more challenging to detect because they are valid words that are used incorrectly in a given context. The Persian language, characterized by its rich morphology and complex syntax, presents formidable challenges to automatic spelling correction systems. Furthermore, the limited availability of Persian language resources makes it difficult to train effective spelling correction models. This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text. Our methodology adopts a structured, multi-tiered approach, employing semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy. The innovative architecture discovers and stores semantic similarities between words and phrases in Persian text. The classifiers accurately identify real-word errors, while the semantic ranking algorithm determines the most probable corrections for real-word errors, taking into account specific spelling correction and context properties such as context, semantic similarity, and edit-distance measures. Evaluations have demonstrated that our proposed method surpasses previous Persian real-word error correction models. Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase. These results clearly indicate that our approach is a highly promising solution for automatic real-word error correction in Persian text.
Title:
Toward Efficient Convolutional Neural Networks With Structured Ternary Patterns
Abstract
High-efficiency deep learning (DL) models are necessary not only to facilitate their use in devices with limited resources but also to improve resources required for training. Convolutional neural networks (ConvNets) typically exert severe demands on local device resources and this conventionally limits their adoption within mobile and embedded platforms. This brief presents work toward utilizing static convolutional filters generated from the space of local binary patterns (LBPs) and Haar features to design efficient ConvNet architectures. These are referred to as Structured Ternary Patterns (STePs) and can be generated during network initialization in a systematic way instead of having learnable weight parameters thus reducing the total weight updates. The ternary values require significantly less storage and with the appropriate low-level implementation, can also lead to inference improvements. The proposed approach is validated using four image classification datasets, demonstrating that common network backbones can be made more efficient and provide competitive results. It is also demonstrated that it is possible to generate completely custom STeP-based networks that provide good trade-offs for on-device applications such as unmanned aerial vehicle (UAV)-based aerial vehicle detection. The experimental results show that the proposed method maintains high detection accuracy while reducing the trainable parameters by 40-80%. This work motivates further research toward good priors for non-learnable weights that can make DL architectures more efficient without having to alter the network during or after training.
Title:
Retrieval Augmented Generation Integrated Large Language Models in Smart Contract Vulnerability Detection
Authors: Jeffy Yu
Subjects: Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Abstract
The rapid growth of Decentralized Finance (DeFi) has been accompanied by substantial financial losses due to smart contract vulnerabilities, underscoring the critical need for effective security auditing. With attacks becoming more frequent, the necessity and demand for auditing services has escalated. This especially creates a financial burden for independent developers and small businesses, who often have limited available funding for these services. Our study builds upon existing frameworks by integrating Retrieval-Augmented Generation (RAG) with large language models (LLMs), specifically employing GPT-4-1106 for its 128k token context window. We construct a vector store of 830 known vulnerable contracts, leveraging Pinecone for vector storage, OpenAI's text-embedding-ada-002 for embeddings, and LangChain to construct the RAG-LLM pipeline. Prompts were designed to provide a binary answer for vulnerability detection. We first test 52 smart contracts 40 times each against a provided vulnerability type, verifying the replicability and consistency of the RAG-LLM. Encouraging results were observed, with a 62.7% success rate in guided detection of vulnerabilities. Second, we challenge the model under a "blind" audit setup, without the vulnerability type provided in the prompt, wherein 219 contracts undergo 40 tests each. This setup evaluates the general vulnerability detection capabilities without hinted context assistance. Under these conditions, a 60.71% success rate was observed. While the results are promising, we still emphasize the need for human auditing at this time. We provide this study as a proof of concept for a cost-effective smart contract auditing process, moving towards democratic access to security.
Title:
Realistic Surgical Image Dataset Generation Based On 3D Gaussian Splatting
Authors: Tianle Zeng, Gerardo Loza Galindo, Junlei Hu, Pietro Valdastri, Dominic Jones
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Computer vision technologies markedly enhance the automation capabilities of robotic-assisted minimally invasive surgery (RAMIS) through advanced tool tracking, detection, and localization. However, the limited availability of comprehensive surgical datasets for training represents a significant challenge in this field. This research introduces a novel method that employs 3D Gaussian Splatting to generate synthetic surgical datasets. We propose a method for extracting and combining 3D Gaussian representations of surgical instruments and background operating environments, transforming and combining them to generate high-fidelity synthetic surgical scenarios. We developed a data recording system capable of acquiring images alongside tool and camera poses in a surgical scene. Using this pose data, we synthetically replicate the scene, thereby enabling direct comparisons of the synthetic image quality (29.592 PSNR). As a further validation, we compared two YOLOv5 models trained on the synthetic and real data, respectively, and assessed their performance in an unseen real-world test dataset. Comparing the performances, we observe an improvement in neural network performance, with the synthetic-trained model outperforming the real-world trained model by 12%, testing both on real-world data.
Title:
An Explainable Fast Deep Neural Network for Emotion Recognition
Authors: Francesco Di Luzio, Antonello Rosato, Massimo Panella
Abstract
In the context of artificial intelligence, the inherent human attribute of engaging in logical reasoning to facilitate decision-making is mirrored by the concept of explainability, which pertains to the ability of a model to provide a clear and interpretable account of how it arrived at a particular outcome. This study explores explainability techniques for binary deep neural architectures in the framework of emotion classification through video analysis. We investigate the optimization of input features to binary classifiers for emotion recognition, with face landmarks detection using an improved version of the Integrated Gradients explainability method. The main contribution of this paper consists in the employment of an innovative explainable artificial intelligence algorithm to understand the crucial facial landmarks movements during emotional feeling, using this information also for improving the performances of deep learning-based emotion classifiers. By means of explainability, we can optimize the number and the position of the facial landmarks used as input features for facial emotion recognition, lowering the impact of noisy landmarks and thus increasing the accuracy of the developed models. In order to test the effectiveness of the proposed approach, we considered a set of deep binary models for emotion classification trained initially with a complete set of facial landmarks, which are progressively reduced based on a suitable optimization procedure. The obtained results prove the robustness of the proposed explainable approach in terms of understanding the relevance of the different facial points for the different emotions, also improving the classification accuracy and diminishing the computational cost.
Title:
Automated Patient Positioning with Learned 3D Hand Gestures
Abstract
Positioning patients for scanning and interventional procedures is a critical task that requires high precision and accuracy. The conventional workflow involves manually adjusting the patient support to align the center of the target body part with the laser projector or other guiding devices. This process is not only time-consuming but also prone to inaccuracies. In this work, we propose an automated patient positioning system that utilizes a camera to detect specific hand gestures from technicians, allowing users to indicate the target patient region to the system and initiate automated positioning. Our approach relies on a novel multi-stage pipeline to recognize and interpret the technicians' gestures, translating them into precise motions of medical devices. We evaluate our proposed pipeline during actual MRI scanning procedures, using RGB-Depth cameras to capture the process. Results show that our system achieves accurate and precise patient positioning with minimal technician intervention. Furthermore, we validate our method on HaGRID, a large-scale hand gesture dataset, demonstrating its effectiveness in hand detection and gesture recognition.
Title:
Self-supervised transformer-based pre-training method with General Plant Infection dataset
Abstract
Pest and disease classification is a challenging issue in agriculture. The performance of deep learning models is intricately linked to training data diversity and quantity, posing issues for plant pest and disease datasets that remain underdeveloped. This study addresses these challenges by constructing a comprehensive dataset and proposing an advanced network architecture that combines Contrastive Learning and Masked Image Modeling (MIM). The dataset comprises diverse plant species and pest categories, making it one of the largest and most varied in the field. The proposed network architecture demonstrates effectiveness in addressing plant pest and disease recognition tasks, achieving notable detection accuracy. This approach offers a viable solution for rapid, efficient, and cost-effective plant pest and disease detection, thereby reducing agricultural production costs. Our code and dataset will be publicly available to advance research in plant pest and disease recognition the GitHub repository at this https URL
Title:
RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies
Abstract
The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively. Our codes will be made available.
Title:
Efficient Intrusion Detection: Combining $\chi^2$ Feature Selection with CNN-BiLSTM on the UNSW-NB15 Dataset
Authors: Mohammed Jouhari, Hafsa Benaddi, Khalil Ibrahimi
Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR)
Abstract
Intrusion Detection Systems (IDSs) have played a significant role in the detection and prevention of cyber-attacks in traditional computing systems. It is not surprising that this technology is now being applied to secure Internet of Things (IoT) networks against cyber threats. However, the limited computational resources available on IoT devices pose a challenge for deploying conventional computing-based IDSs. IDSs designed for IoT environments must demonstrate high classification performance, and utilize low-complexity models. Developing intrusion detection models in the field of IoT has seen significant advancements. However, achieving a balance between high classification performance and reduced complexity remains a challenging endeavor. In this research, we present an effective IDS model that addresses this issue by combining a lightweight Convolutional Neural Network (CNN) with bidirectional Long Short-Term Memory (BiLSTM). Additionally, we employ feature selection techniques to minimize the number of features inputted into the model, thereby reducing its complexity. This approach renders the proposed model highly suitable for resource-constrained IoT devices, ensuring it meets their computation capability requirements. Creating a model that meets the demands of IoT devices and attains enhanced precision is a challenging task. However, our suggested model outperforms previous works in the literature by attaining a remarkable accuracy rate of 97.90% within a prediction time of 1.1 seconds for binary classification. Furthermore, it achieves an accuracy rate of 97.09% within a prediction time of 2.10 seconds for multiclassification.
Title:
Requiem for a drone: a machine-learning based framework for stealthy attacks against unmanned autonomous vehicles
Authors: Kyo Hyun Kim, Denizhan Kara, Vineetha Paruchuri, Sibin Mohan, Greg Kimberly, Jae Kim, Josh Eckhardt
Subjects: Subjects:
Cryptography and Security (cs.CR)
Abstract
There is a space of uncertainty in the modeling of vehicular dynamics of autonomous systems due to noise in sensor readings, environmental factors or modeling errors. We present Requiem, a software-only, blackbox approach that exploits this space in a stealthy manner causing target systems, e.g., unmanned aerial vehicles (UAVs), to significantly deviate from their mission parameters. Our system achieves this by modifying sensor values, all while avoiding detection by onboard anomaly detectors (hence, "stealthy"). The Requiem framework uses a combination of multiple deep learning models (that we refer to as "surrogates" and "spoofers") coupled with extensive, realistic simulations on a software-in-the-loop quadrotor UAV system. Requiem makes no assumptions about either the (types of) sensors or the onboard state estimation algorithm(s) -- it works so long as the latter is "learnable". We demonstrate the effectiveness of our system using various attacks across multiple missions as well as multiple sets of statistical analyses. We show that Requiem successfully exploits the modeling errors (i.e., causes significant deviations from planned mission parameters) while remaining stealthy (no detection even after {tens of meters of deviations}) and are generalizable (Requiem has potential to work across different attacks and sensor types).
Title:
ViT LoS V2X: Vision Transformers for Environment-aware LoS Blockage Prediction for 6G Vehicular Networks
Authors: Ghazi Gharsallah, Georges Kaddoum
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Abstract
As wireless communication technology progresses towards the sixth generation (6G), high-frequency millimeter-wave (mmWave) communication has emerged as a promising candidate for enabling vehicular networks. It offers high data rates and low-latency communication. However, obstacles such as buildings, trees, and other vehicles can cause signal attenuation and blockage, leading to communication failures that can result in fatal accidents or traffic congestion. Predicting blockages is crucial for ensuring reliable and efficient communications. Furthermore, the advent of 6G technology is anticipated to integrate advanced sensing capabilities, utilizing a variety of sensor types. These sensors, ranging from traditional RF sensors to cameras and Lidar sensors, are expected to provide access to rich multimodal data, thereby enriching communication systems with a wealth of additional contextual information. Leveraging this multimodal data becomes essential for making precise network management decisions, including the crucial task of blockage detection. In this paper, we propose a Deep Learning (DL)-based approach that combines Convolutional Neural Networks (CNNs) and customized Vision Transformers (ViTs) to effectively extract essential information from multimodal data and predict blockages in vehicular networks. Our method capitalizes on the synergistic strengths of CNNs and ViTs to extract features from time-series multimodal data, which include images and beam vectors. To capture temporal dependencies between the extracted features and the blockage state at future time steps, we employ a Gated Recurrent Unit (GRU)-based architecture. Our results show that the proposed approach achieves high accuracy and outperforms state-of-the-art solutions, achieving more than $95\%$ accurate predictions.
Title:
A General Framework for Data-Use Auditing of ML Models
Authors: Zonghao Huang, Neil Zhenqiang Gong, Michael K. Reiter
Subjects: Subjects:
Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract
Auditing the use of data in training machine-learning (ML) models is an increasingly pressing challenge, as myriad ML practitioners routinely leverage the effort of content creators to train models without their permission. In this paper, we propose a general method to audit an ML model for the use of a data-owner's data in training, without prior knowledge of the ML task for which the data might be used. Our method leverages any existing black-box membership inference method, together with a sequential hypothesis test of our own design, to detect data use with a quantifiable, tunable false-detection rate. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models, namely image classifiers and foundation models.
Title:
UAV Active Perception and Motion Control for Improving Navigation Using Low-Cost Sensors
Abstract
In this study a model pipeline is proposed that combines computer vision with control-theoretic methods and utilizes low cost sensors. The proposed work enables perception-aware motion control for a quadrotor UAV to detect and navigate to objects of interest such as wind turbines and electric towers. The distance to the object of interest was estimated utilizing RGB as the primary sensory input. For the needs of the study, the Microsoft AirSim simulator was used. As a first step, a YOLOv8 model was integrated providing the basic position setpoints towards the detection. From the YOLOv8 inference, a target yaw angle was derived. The subsequent algorithms, combining performant in computational terms computer vision methods and YOLOv8, actively drove the drone to measure the height of the detection. Based on the height, an estimate of the depth was retrieved. In addition to this step, a convolutional neural network was developed, namely ActvePerceptionNet aiming at active YOLOv8 inference. The latter was validated for wind turbines where the rotational motion of the propeller was found to affect object confidence in a near periodical fashion. The results of the simulation experiments conducted in this study showed efficient object height and distance estimation and effective localization.
Title:
Rethinking Feature Backbone Fine-tuning for Remote Sensing Object Detection
Authors: Yechan Kim, JongHyun Park, SooYeon Kim, Moongu Jeon
Abstract
Recently, numerous methods have achieved impressive performance in remote sensing object detection, relying on convolution or transformer architectures. Such detectors typically have a feature backbone to extract useful features from raw input images. For the remote sensing domain, a common practice among current detectors is to initialize the backbone with pre-training on ImageNet consisting of natural scenes. Fine-tuning the backbone is typically required to generate features suitable for remote-sensing images. However, this could hinder the extraction of basic visual features in long-term training, thus restricting performance improvement. To mitigate this issue, we propose a novel method named DBF (Dynamic Backbone Freezing) for feature backbone fine-tuning on remote sensing object detection. Our method aims to handle the dilemma of whether the backbone should extract low-level generic features or possess specific knowledge of the remote sensing domain, by introducing a module called 'Freezing Scheduler' to dynamically manage the update of backbone features during training. Extensive experiments on DOTA and DIOR-R show that our approach enables more accurate model learning while substantially reducing computational costs. Our method can be seamlessly adopted without additional effort due to its straightforward design.
Title:
Mitigating Deep Reinforcement Learning Backdoors in the Neural Activation Space
Authors: Sanyam Vyas, Chris Hicks, Vasilios Mavroudis
Abstract
This paper investigates the threat of backdoors in Deep Reinforcement Learning (DRL) agent policies and proposes a novel method for their detection at runtime. Our study focuses on elusive in-distribution backdoor triggers. Such triggers are designed to induce a deviation in the behaviour of a backdoored agent while blending into the expected data distribution to evade detection. Through experiments conducted in the Atari Breakout environment, we demonstrate the limitations of current sanitisation methods when faced with such triggers and investigate why they present a challenging defence problem. We then evaluate the hypothesis that backdoor triggers might be easier to detect in the neural activation space of the DRL agent's policy network. Our statistical analysis shows that indeed the activation patterns in the agent's policy network are distinct in the presence of a trigger, regardless of how well the trigger is concealed in the environment. Based on this, we propose a new defence approach that uses a classifier trained on clean environment samples and detects abnormal activations. Our results show that even lightweight classifiers can effectively prevent malicious actions with considerable accuracy, indicating the potential of this research direction even against sophisticated adversaries.
Title:
Semi-Supervised Pipe Video Temporal Defect Interval Localization
Authors: Zhu Huang, Gang Pan, Chao Kang, YaoZhi Lv
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
In sewer pipe Closed-Circuit Television (CCTV) inspection, accurate temporal defect localization is essential for effective defect classification, detection, segmentation and quantification. Industry standards typically do not require time-interval annotations, even though they are more informative than time-point annotations for defect localization, resulting in additional annotation costs when fully supervised methods are used. Additionally, differences in scene types and camera motion patterns between pipe inspections and Temporal Action Localization (TAL) hinder the effective transfer of point-supervised TAL methods. Therefore, this study introduces a Semi-supervised multi-Prototype-based method incorporating visual Odometry for enhanced attention guidance (PipeSPO). PipeSPO fully leverages unlabeled data through unsupervised pretext tasks and utilizes time-point annotated data with a weakly supervised multi-prototype-based method, relying on visual odometry features to capture camera pose information. Experiments on real-world datasets demonstrate that PipeSPO achieves 41.89% average precision across Intersection over Union (IoU) thresholds of 0.1-0.7, improving by 8.14% over current state-of-the-art methods.
Title:
Error Detection and Constraint Recovery in Hierarchical Multi-Label Classification without Prior Knowledge
Abstract
Recent advances in Hierarchical Multi-label Classification (HMC), particularly neurosymbolic-based approaches, have demonstrated improved consistency and accuracy by enforcing constraints on a neural model during training. However, such work assumes the existence of such constraints a-priori. In this paper, we relax this strong assumption and present an approach based on Error Detection Rules (EDR) that allow for learning explainable rules about the failure modes of machine learning models. We show that these rules are not only effective in detecting when a machine learning classifier has made an error but also can be leveraged as constraints for HMC, thereby allowing the recovery of explainable constraints even if they are not provided. We show that our approach is effective in detecting machine learning errors and recovering constraints, is noise tolerant, and can function as a source of knowledge for neurosymbolic models on multiple datasets, including a newly introduced military vehicle recognition dataset.
Title:
Multiple Object Detection and Tracking in Panoramic Videos for Cycling Safety Analysis
Abstract
Panoramic cycling videos can record 360° views around the cyclists. Thus, it is essential to conduct automatic road user analysis on them using computer vision models to provide data for studies on cycling safety. However, the features of panoramic data such as severe distortions, large number of small objects and boundary continuity have brought great challenges to the existing CV models, including poor performance and evaluation methods that are no longer applicable. In addition, due to the lack of data with annotations, it is not easy to re-train the models. In response to these problems, the project proposed and implemented a three-step methodology: (1) improve the prediction performance of the pre-trained object detection models on panoramic data by projecting the original image into 4 perspective sub-images; (2) introduce supports for boundary continuity and category information into DeepSORT, a commonly used multiple object tracking model, and set an improved detection model as its detector; (3) using the tracking results, develop an application for detecting the overtaking behaviour of the surrounding vehicles. Evaluated on the panoramic cycling dataset built by the project, the proposed methodology improves the average precision of YOLO v5m6 and Faster RCNN-FPN under any input resolution setting. In addition, it raises MOTA and IDF1 of DeepSORT by 7.6\% and 9.7\% respectively. When detecting the overtakes in the test videos, it achieves the F-score of 0.88. The code is available on GitHub at this http URL to ensure the reproducibility and further improvements of results.
Title:
A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech
Authors: Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar
Subjects: Subjects:
Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Abstract
Violence-provoking speech -- speech that implicitly or explicitly promotes violence against the members of the targeted community, contributed to a massive surge in anti-Asian crimes during the pandemic. While previous works have characterized and built tools for detecting other forms of harmful speech, like fear speech and hate speech, our work takes a community-centric approach to studying anti-Asian violence-provoking speech. Using data from ~420k Twitter posts spanning a 3-year duration (January 1, 2020 to February 1, 2023), we develop a codebook to characterize anti-Asian violence-provoking speech and collect a community-crowdsourced dataset to facilitate its large-scale detection using state-of-the-art classifiers. We contrast the capabilities of natural language processing classifiers, ranging from BERT-based to LLM-based classifiers, in detecting violence-provoking speech with their capabilities to detect anti-Asian hateful speech. In contrast to prior work that has demonstrated the effectiveness of such classifiers in detecting hateful speech ($F_1 = 0.89$), our work shows that accurate and reliable detection of violence-provoking speech is a challenging task ($F_1 = 0.69$). We discuss the implications of our findings, particularly the need for proactive interventions to support Asian communities during public health crises. The resources related to the study are available at this https URL.
Title:
An Adaptive System for Wearable Devices to Detect Stress Using Physiological Signals
Authors: Gelei Xu, Ruiyang Qin, Zhi Zheng, Yiyu Shi
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Timely stress detection is crucial for protecting vulnerable groups from long-term detrimental effects by enabling early intervention. Wearable devices, by collecting real-time physiological signals, offer a solution for accurate stress detection accommodating individual differences. This position paper introduces an adaptive framework for personalized stress detection using PPG and EDA signals. Unlike traditional methods that rely on a generalized model, which may suffer performance drops when applied to new users due to domain shifts, this framework aims to provide each user with a personalized model for higher stress detection accuracy. The framework involves three stages: developing a generalized model offline with an initial dataset, adapting the model to the user's unlabeled data, and fine-tuning it with a small set of labeled data obtained through user interaction. This approach not only offers a foundation for mobile applications that provide personalized stress detection and intervention but also has the potential to address a wider range of mental health issues beyond stress detection using physiological signals.
Title:
LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
Authors: Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Quresh, Scott Mahlke, Wen-mei Hwu
Abstract
Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storagebased approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads.LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training
Title:
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Authors: Kwanyong Park, Kuniaki Saito, Donghyun Kim
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Vision-language (VL) models often exhibit a limited understanding of complex expressions of visual objects (e.g., attributes, shapes, and their relations), given complex and diverse language queries. Traditional approaches attempt to improve VL models using hard negative synthetic text, but their effectiveness is limited. In this paper, we harness the exceptional compositional understanding capabilities of generative foundational models. We introduce a novel method for structured synthetic data generation aimed at enhancing the compositional understanding of VL models in language-based object detection. Our framework generates densely paired positive and negative triplets (image, text descriptions, and bounding boxes) in both image and text domains. By leveraging these synthetic triplets, we transform 'weaker' VL models into 'stronger' models in terms of compositional understanding, a process we call "Weak-to-Strong Compositional Learning" (WSCL). To achieve this, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.
Title:
Appearance-Based Loop Closure Detection for Online Large-Scale and Long-Term Operation
Authors: Mathieu Labbé, François Michaud
Subjects: Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
In appearance-based localization and mapping, loop closure detection is the process used to determinate if the current observation comes from a previously visited location or a new one. As the size of the internal map increases, so does the time required to compare new observations with all stored locations, eventually limiting online processing. This paper presents an online loop closure detection approach for large-scale and long-term operation. The approach is based on a memory management method, which limits the number of locations used for loop closure detection so that the computation time remains under real-time constraints. The idea consists of keeping the most recent and frequently observed locations in a Working Memory (WM) used for loop closure detection, and transferring the others into a Long-Term Memory (LTM). When a match is found between the current location and one stored in WM, associated locations stored in LTM can be updated and remembered for additional loop closure detections. Results demonstrate the approach's adaptability and scalability using ten standard data sets from other appearance-based loop closure approaches, one custom data set using real images taken over a 2 km loop of our university campus, and one custom data set (7 hours) using virtual images from the racing video game ``Need for Speed: Most Wanted''.
Title:
Online Global Loop Closure Detection for Large-Scale Multi-Session Graph-Based SLAM
Abstract
For large-scale and long-term simultaneous localization and mapping (SLAM), a robot has to deal with unknown initial positioning caused by either the kidnapped robot problem or multi-session mapping. This paper addresses these problems by tying the SLAM system with a global loop closure detection approach, which intrinsically handles these situations. However, online processing for global loop closure detection approaches is generally influenced by the size of the environment. The proposed graph-based SLAM system uses a memory management approach that only consider portions of the map to satisfy online processing requirements. The approach is tested and demonstrated using five indoor mapping sessions of a building using a robot equipped with a laser rangefinder and a Kinect.
Title:
FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification
Authors: Weiping Ding, Tianyi Zhou, Jiashuang Huang, Shu Jiang, Tao Hou, Chin-Teng Lin
Abstract
Histopathological image classification constitutes a pivotal task in computer-aided diagnostics. The precise identification and categorization of histopathological images are of paramount significance for early disease detection and treatment. In the diagnostic process of pathologists, a multi-tiered approach is typically employed to assess abnormalities in cell regions at different magnifications. However, feature extraction is often performed at a single granularity, overlooking the multi-granular characteristics of cells. To address this issue, we propose the Fuzzy-guided Multi-granularity Deep Neural Network (FMDNN). Inspired by the multi-granular diagnostic approach of pathologists, we perform feature extraction on cell structures at coarse, medium, and fine granularity, enabling the model to fully harness the information in histopathological images. We incorporate the theory of fuzzy logic to address the challenge of redundant key information arising during multi-granular feature extraction. Cell features are described from different perspectives using multiple fuzzy membership functions, which are fused to create universal fuzzy features. A fuzzy-guided cross-attention module guides universal fuzzy features toward multi-granular features. We propagate these features through an encoder to all patch tokens, aiming to achieve enhanced classification accuracy and robustness. In experiments on multiple public datasets, our model exhibits a significant improvement in accuracy over commonly used classification methods for histopathological image classification and shows commendable interpretability.
Title:
Open-CD: A Comprehensive Toolbox for Change Detection
Abstract
We present Open-CD, a change detection toolbox that contains a rich set of change detection methods as well as related components and modules. The toolbox started from a series of open source general vision task tools, including OpenMMLab Toolkits, PyTorch Image Models, etc. It gradually evolves into a unified platform that covers many popular change detection methods and contemporary modules. It not only includes training and inference codes, but also provides some useful scripts for data analysis. We believe this toolbox is by far the most complete change detection toolbox. In this report, we introduce the various features, supported methods and applications of Open-CD. In addition, we also conduct a benchmarking study on different methods and components. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to reimplement existing methods and develop their own new change detectors. Code and models are available at \url{this https URL}. Pioneeringly, this report also includes brief descriptions of the algorithms supported in Open-CD, mainly contributed by their authors. We sincerely encourage researchers in this field to participate in this project and work together to create a more open community. This toolkit and report will be kept updated.
Title:
Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection
Authors: Yiran Yang, Xu Gao, Tong Wang, Xin Hao, Yifeng Shi, Xiao Tan, Xiaoqing Ye, Jingdong Wang
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.
Title:
RoadPainter: Points Are Ideal Navigators for Topology transformER
Abstract
Topology reasoning aims to provide a precise understanding of road scenes, enabling autonomous systems to identify safe and efficient routes. In this paper, we present RoadPainter, an innovative approach for detecting and reasoning the topology of lane centerlines using multi-view images. The core concept behind RoadPainter is to extract a set of points from each centerline mask to improve the accuracy of centerline prediction. We start by implementing a transformer decoder that integrates a hybrid attention mechanism and a real-virtual separation strategy to predict coarse lane centerlines and establish topological associations. Then, we generate centerline instance masks guided by the centerline points from the transformer decoder. Moreover, we derive an additional set of points from each mask and combine them with previously detected centerline points for further refinement. Additionally, we introduce an optional module that incorporates a Standard Definition (SD) map to further optimize centerline detection and enhance topological reasoning performance. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of RoadPainter.
Title:
MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
Authors: Chunyang Li, Hao Peng, Xiaozhi Wang, Yunjia Qi, Lei Hou, Bin Xu, Juanzi Li
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
Event Factuality Detection (EFD) task determines the factuality of textual events, i.e., classifying whether an event is a fact, possibility, or impossibility, which is essential for faithfully understanding and utilizing event knowledge. However, due to the lack of high-quality large-scale data, event factuality detection is under-explored in event understanding research, which limits the development of EFD community. To address these issues and provide faithful event understanding, we introduce MAVEN-Fact, a large-scale and high-quality EFD dataset based on the MAVEN dataset. MAVEN-Fact includes factuality annotations of 112,276 events, making it the largest EFD dataset. Extensive experiments demonstrate that MAVEN-Fact is challenging for both conventional fine-tuned models and large language models (LLMs). Thanks to the comprehensive annotations of event arguments and relations in MAVEN, MAVEN-Fact also supports some further analyses and we find that adopting event arguments and relations helps in event factuality detection for fine-tuned models but does not benefit LLMs. Furthermore, we preliminarily study an application case of event factuality detection and find it helps in mitigating event-related hallucination in LLMs. Our dataset and codes can be obtained from \url{this https URL}
Title:
Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection
Abstract
The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement.
Title:
Sparse Prior Is Not All You Need: When Differential Directionality Meets Saliency Coherence for Infrared Small Target Detection
Abstract
Infrared small target detection is crucial for the efficacy of infrared search and tracking systems. Current tensor decomposition methods emphasize representing small targets with sparsity but struggle to separate targets from complex backgrounds due to insufficient use of intrinsic directional information and reduced target visibility during decomposition. To address these challenges, this study introduces a Sparse Differential Directionality prior (SDD) framework. SDD leverages the distinct directional characteristics of targets to differentiate them from the background, applying mixed sparse constraints on the differential directional images and continuity difference matrix of the temporal component, both derived from Tucker decomposition. We further enhance target detectability with a saliency coherence strategy that intensifies target contrast against the background during hierarchical decomposition. A Proximal Alternating Minimization-based (PAM) algorithm efficiently solves our proposed model. Experimental results on several real-world datasets validate our method's effectiveness, outperforming ten state-of-the-art methods in target detection and clutter suppression. Our code is available at this https URL.
Title:
Poisoning with A Pill: Circumventing Detection in Federated Learning
Authors: Hanxi Guo, Hao Wang, Tao Song, Tianhang Zheng, Yang Hua, Haibing Guan, Xiangyu Zhang
Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Without direct access to the client's data, federated learning (FL) is well-known for its unique strength in data privacy protection among existing distributed machine learning techniques. However, its distributive and iterative nature makes FL inherently vulnerable to various poisoning attacks. To counteract these threats, extensive defenses have been proposed to filter out malicious clients, using various detection metrics. Based on our analysis of existing attacks and defenses, we find that there is a lack of attention to model redundancy. In neural networks, various model parameters contribute differently to the model's performance. However, existing attacks in FL manipulate all the model update parameters with the same strategy, making them easily detectable by common defenses. Meanwhile, the defenses also tend to analyze the overall statistical features of the entire model updates, leaving room for sophisticated attacks. Based on these observations, this paper proposes a generic and attack-agnostic augmentation approach designed to enhance the effectiveness and stealthiness of existing FL poisoning attacks against detection in FL, pointing out the inherent flaws of existing defenses and exposing the necessity of fine-grained FL security. Specifically, we employ a three-stage methodology that strategically constructs, generates, and injects poison (generated by existing attacks) into a pill (a tiny subnet with a novel structure) during the FL training, named as pill construction, pill poisoning, and pill injection accordingly. Extensive experimental results show that FL poisoning attacks enhanced by our method can bypass all the popular defenses, and can gain an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.
Title:
Automated Road Safety: Enhancing Sign and Surface Damage Detection with AI
Authors: Davide Merolla, Vittorio Latorre, Antonio Salis, Gianluca Boanelli
Abstract
Public transportation plays a crucial role in our lives, and the road network is a vital component in the implementation of smart cities. Recent advancements in AI have enabled the development of advanced monitoring systems capable of detecting anomalies in road surfaces and road signs, which, if unaddressed, can lead to serious road accidents. This paper presents an innovative approach to enhance road safety through the detection and classification of traffic signs and road surface damage using advanced deep learning techniques. This integrated approach supports proactive maintenance strategies, improving road safety and resource allocation for the Molise region and the city of Campobasso. The resulting system, developed as part of the Casa delle Tecnologie Emergenti (House of Emergent Technologies) Molise (Molise CTE) research project funded by the Italian Minister of Economic Growth (MIMIT), leverages cutting-edge technologies such as Cloud Computing and High Performance Computing with GPU utilization. It serves as a valuable tool for municipalities, enabling quick detection of anomalies and the prompt organization of maintenance operations
Title:
Bidirectional skip-frame prediction for video anomaly detection with intra-domain disparity-driven attention
Authors: Jiahao Lyu, Minghua Zhao, Jing Hu, Runtao Xi, Xuewen Huang, Shuangli Du, Cheng Shi, Tian Ma
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
With the widespread deployment of video surveillance devices and the demand for intelligent system development, video anomaly detection (VAD) has become an important part of constructing intelligent surveillance systems. Expanding the discriminative boundary between normal and abnormal events to enhance performance is the common goal and challenge of VAD. To address this problem, we propose a Bidirectional Skip-frame Prediction (BiSP) network based on a dual-stream autoencoder, from the perspective of learning the intra-domain disparity between different features. The BiSP skips frames in the training phase to achieve the forward and backward frame prediction respectively, and in the testing phase, it utilizes bidirectional consecutive frames to co-predict the same intermediate frames, thus expanding the degree of disparity between normal and abnormal events. The BiSP designs the variance channel attention and context spatial attention from the perspectives of movement patterns and object scales, respectively, thus ensuring the maximization of the disparity between normal and abnormal in the feature extraction and delivery with different dimensions. Extensive experiments from four benchmark datasets demonstrate the effectiveness of the proposed BiSP, which substantially outperforms state-of-the-art competing methods.
Title:
YOLO-pdd: A Novel Multi-scale PCB Defect Detection Method Using Deep Representations with Sequential Images
Abstract
With the rapid growth of the PCB manufacturing industry, there is an increasing demand for computer vision inspection to detect defects during production. Improving the accuracy and generalization of PCB defect detection models remains a significant challenge. This paper proposes a high-precision, robust, and real-time end-to-end method for PCB defect detection based on deep Convolutional Neural Networks (CNN). Traditional methods often suffer from low accuracy and limited applicability. We propose a novel approach combining YOLOv5 and multiscale modules for hierarchical residual-like connections. In PCB defect detection, noise can confuse the background and small targets. The YOLOv5 model provides a strong foundation with its real-time processing and accurate object detection capabilities. The multi-scale module extends traditional approaches by incorporating hierarchical residual-like connections within a single block, enabling multiscale feature extraction. This plug-and-play module significantly enhances performance by extracting features at multiple scales and levels, which are useful for identifying defects of varying sizes and complexities. Our multi-scale architecture integrates feature extraction, defect localization, and classification into a unified network. Experiments on a large-scale PCB dataset demonstrate significant improvements in precision, recall, and F1-score compared to existing methods. This work advances computer vision inspection for PCB defect detection, providing a reliable solution for high-precision, robust, real-time, and domain-adaptive defect detection in the PCB manufacturing industry.
Title:
Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned
Authors: Song Wang, Xun Wang, Jie Mei, Yujia Xie, Sean Muarray, Zhang Li, Lingfeng Wu, Si-Qing Chen, Wayne Xiong
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
Hallucination, a phenomenon where large language models (LLMs) produce output that is factually incorrect or unrelated to the input, is a major challenge for LLM applications that require accuracy and dependability. In this paper, we introduce a reliable and high-speed production system aimed at detecting and rectifying the hallucination issue within LLMs. Our system encompasses named entity recognition (NER), natural language inference (NLI), span-based detection (SBD), and an intricate decision tree-based process to reliably detect a wide range of hallucinations in LLM responses. Furthermore, our team has crafted a rewriting mechanism that maintains an optimal mix of precision, response time, and cost-effectiveness. We detail the core elements of our framework and underscore the paramount challenges tied to response time, availability, and performance metrics, which are crucial for real-world deployment of these technologies. Our extensive evaluation, utilizing offline data and live production traffic, confirms the efficacy of our proposed framework and service.
Title:
TextureCrop: Enhancing Synthetic Image Detection through Texture-based Cropping
Abstract
Generative AI technologies produce hyper-realistic imagery that can be used for nefarious purposes such as producing misleading or harmful content, among others. This makes Synthetic Image Detection (SID) an essential tool for defending against AI-generated harmful content. Current SID methods typically resize input images to a fixed resolution or perform center-cropping due to computational concerns, leading to challenges in effectively detecting artifacts in high-resolution images. To this end, we propose TextureCrop, a novel image pre-processing technique. By focusing on high-frequency image parts where generation artifacts are prevalent, TextureCrop effectively enhances SID accuracy while maintaining manageable memory requirements. Experimental results demonstrate a consistent improvement in AUC across various detectors by 5.7% compared to center cropping and by 14% compared to resizing, across high-resolution images from the Forensynths and Synthbuster datasets.
Title:
Semi-Supervised Learning for Anomaly Detection in Blockchain-based Supply Chains
Authors: Do Hai Son, Bui Duc Manh, Tran Viet Khoa, Nguyen Linh Trung, Dinh Thai Hoang, Hoang Trong Minh, Yibeltal Alem, Le Quang Minh
Subjects: Subjects:
Cryptography and Security (cs.CR)
Abstract
Blockchain-based supply chain (BSC) systems have tremendously been developed recently and can play an important role in our society in the future. In this study, we develop an anomaly detection model for BSC systems. Our proposed model can detect cyber-attacks at various levels, including the network layer, consensus layer, and beyond, by analyzing only the traffic data at the network layer. To do this, we first build a BSC system at our laboratory to perform experiments and collect datasets. We then propose a novel semi-supervised DAE-MLP (Deep AutoEncoder-Multilayer Perceptron) that combines the advantages of supervised and unsupervised learning to detect anomalies in BSC systems. The experimental results demonstrate the effectiveness of our model for anomaly detection within BSCs, achieving a detection accuracy of 96.5%. Moreover, DAE-MLP can effectively detect new attacks by improving the F1-score up to 33.1% after updating the MLP component.
Title:
Norface: Improving Facial Expression Analysis by Identity Normalization
Authors: Hanwei Liu, Rudong An, Zhimeng Zhang, Bowen Ma, Wei Zhang, Yan Song, Yujing Hu, Wei Chen, Yu Ding
Abstract
Facial Expression Analysis remains a challenging task due to unexpected task-irrelevant noise, such as identity, head pose, and background. To address this issue, this paper proposes a novel framework, called Norface, that is unified for both Action Unit (AU) analysis and Facial Emotion Recognition (FER) tasks. Norface consists of a normalization network and a classification network. First, the carefully designed normalization network struggles to directly remove the above task-irrelevant noise, by maintaining facial expression consistency but normalizing all original images to a common identity with consistent pose, and background. Then, these additional normalized images are fed into the classification network. Due to consistent identity and other factors (e.g. head pose, background, etc.), the normalized images enable the classification network to extract useful expression information more effectively. Additionally, the classification network incorporates a Mixture of Experts to refine the latent representation, including handling the input of facial representations and the output of multiple (AU or emotion) labels. Extensive experiments validate the carefully designed framework with the insight of identity normalization. The proposed method outperforms existing SOTA methods in multiple facial expression analysis tasks, including AU detection, AU intensity estimation, and FER tasks, as well as their cross-dataset tasks. For the normalized datasets and code please visit {this https URL}.
Title:
SS-SFR: Synthetic Scenes Spatial Frequency Response on Virtual KITTI and Degraded Automotive Simulations for Object Detection
Authors: Daniel Jakab, Alexander Braun, Cathaoir Agnew, Reenu Mohandas, Brian Michael Deegan, Dara Molloy, Enda Ward, Tony Scanlan, Ciarán Eising
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Automotive simulation can potentially compensate for a lack of training data in computer vision applications. However, there has been little to no image quality evaluation of automotive simulation and the impact of optical degradations on simulation is little explored. In this work, we investigate Virtual KITTI and the impact of applying variations of Gaussian blur on image sharpness. Furthermore, we consider object detection, a common computer vision application on three different state-of-the-art models, thus allowing us to characterize the relationship between object detection and sharpness. It was found that while image sharpness (MTF50) degrades from an average of 0.245cy/px to approximately 0.119cy/px; object detection performance stays largely robust within 0.58\%(Faster RCNN), 1.45\%(YOLOF) and 1.93\%(DETR) across all respective held-out test sets.
Title:
AI-Driven Fast and Early Detection of IoT Botnet Threats: A Comprehensive Network Traffic Analysis Approach
Abstract
In the rapidly evolving landscape of cyber threats targeting the Internet of Things (IoT) ecosystem, and in light of the surge in botnet-driven Distributed Denial of Service (DDoS) and brute force attacks, this study focuses on the early detection of IoT bots. It specifically addresses the detection of stealth bot communication that precedes and orchestrates attacks. This study proposes a comprehensive methodology for analyzing IoT network traffic, including considerations for both unidirectional and bidirectional flow, as well as packet formats. It explores a wide spectrum of network features critical for representing network traffic and characterizing benign IoT traffic patterns effectively. Moreover, it delves into the modeling of traffic using various semi-supervised learning techniques. Through extensive experimentation with the IoT-23 dataset - a comprehensive collection featuring diverse botnet types and traffic scenarios - we have demonstrated the feasibility of detecting botnet traffic corresponding to different operations and types of bots, specifically focusing on stealth command and control (C2) communications. The results obtained have demonstrated the feasibility of identifying C2 communication with a 100% success rate through packet-based methods and 94% via flow based approaches, with a false positive rate of 1.53%.
Title:
Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi -- Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)
Authors: Ishan Kavathekar, Anku Rani, Ashmit Chamoli, Ponnurangam Kumaraguru, Amit Sheth, Amitava Das
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
The widespread adoption of large language models (LLMs) and awareness around multilingual LLMs have raised concerns regarding the potential risks and repercussions linked to the misapplication of AI-generated text, necessitating increased vigilance. While these models are primarily trained for English, their extensive training on vast datasets covering almost the entire web, equips them with capabilities to perform well in numerous other languages. AI-Generated Text Detection (AGTD) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by the emergence of techniques to bypass detection. In this paper, we report our investigation on AGTD for an indic language Hindi. Our major contributions are in four folds: i) examined 26 LLMs to evaluate their proficiency in generating Hindi text, ii) introducing the AI-generated news article in Hindi ($AG{hi}$) dataset, iii) evaluated the effectiveness of five recently proposed AGTD techniques: ConDA, J-Guard, RADAR, RAIDAR and Intrinsic Dimension Estimation for detecting AI-generated Hindi text, iv) proposed Hindi AI Detectability Index ($ADI{hi}$) which shows a spectrum to understand the evolving landscape of eloquence of AI-generated text in Hindi. We will make the codes and datasets available to encourage further research.
Title:
A Life-long Learning Intrusion Detection System for 6G-Enabled IoV
Abstract
The introduction of 6G technology into the Internet of Vehicles (IoV) promises to revolutionize connectivity with ultra-high data rates and seamless network coverage. However, this technological leap also brings significant challenges, particularly for the dynamic and diverse IoV landscape, which must meet the rigorous reliability and security requirements of 6G networks. Furthermore, integrating 6G will likely increase the IoV's susceptibility to a spectrum of emerging cyber threats. Therefore, it is crucial for security mechanisms to dynamically adapt and learn new attack patterns, keeping pace with the rapid evolution and diversification of these threats - a capability currently lacking in existing systems. This paper presents a novel intrusion detection system leveraging the paradigm of life-long (or continual) learning. Our methodology combines class-incremental learning with federated learning, an approach ideally suited to the distributed nature of the IoV. This strategy effectively harnesses the collective intelligence of Connected and Automated Vehicles (CAVs) and edge computing capabilities to train the detection system. To the best of our knowledge, this study is the first to synergize class-incremental learning with federated learning specifically for cyber attack detection. Through comprehensive experiments on a recent network traffic dataset, our system has exhibited a robust adaptability in learning new cyber attack patterns, while effectively retaining knowledge of previously encountered ones. Additionally, it has proven to maintain high accuracy and a low false positive rate.
Title:
Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond
Authors: Silvio Galesso, Philipp Schröppel, Hssan Driss, Thomas Brox
Abstract
In recent years, research on out-of-distribution (OoD) detection for semantic segmentation has mainly focused on road scenes -- a domain with a constrained amount of semantic diversity. In this work, we challenge this constraint and extend the domain of this task to general natural images. To this end, we introduce: 1. the ADE-OoD benchmark, which is based on the ADE20k dataset and includes images from diverse domains with a high semantic diversity, and 2. a novel approach that uses Diffusion score matching for OoD detection (DOoD) and is robust to the increased semantic diversity. ADE-OoD features indoor and outdoor images, defines 150 semantic categories as in-distribution, and contains a variety of OoD objects. For DOoD, we train a diffusion model with an MLP architecture on semantic in-distribution embeddings and build on the score matching interpretation to compute pixel-wise OoD scores at inference time. On common road scene OoD benchmarks, DOoD performs on par or better than the state of the art, without using outliers for training or making assumptions about the data domain. On ADE-OoD, DOoD outperforms previous approaches, but leaves much room for future improvements.
Title:
Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis
Authors: Brian K. S. Isaac-Medina, Yona Falinie A. Gaus, Neelanjan Bhowmik, Toby P. Breckon
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Object detection is a pivotal task in computer vision that has received significant attention in previous years. Nonetheless, the capability of a detector to localise objects out of the training distribution remains unexplored. Whilst recent approaches in object-level out-of-distribution (OoD) detection heavily rely on class labels, such approaches contradict truly open-world scenarios where the class distribution is often unknown. In this context, anomaly detection focuses on detecting unseen instances rather than classifying detections as OoD. This work aims to bridge this gap by leveraging an open-world object detector and an OoD detector via virtual outlier synthesis. This is achieved by using the detector backbone features to first learn object pseudo-classes via self-supervision. These pseudo-classes serve as the basis for class-conditional virtual outlier sampling of anomalous features that are classified by an OoD head. Our approach empowers our overall object detector architecture to learn anomaly-aware feature representations without relying on class labels, hence enabling truly open-world object anomaly detection. Empirical validation of our approach demonstrates its effectiveness across diverse datasets encompassing various imaging modalities (visible, infrared, and X-ray). Moreover, our method establishes state-of-the-art performance on object-level anomaly detection, achieving an average recall score improvement of over 5.4% for natural images and 23.5% for a security X-ray dataset compared to the current approaches. In addition, our method detects anomalies in datasets where current approaches fail. Code available at this https URL.
Title:
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay
Authors: Yongcan Yu, Lijun Sheng, Ran He, Jian Liang
Abstract
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time. Existing TTA methods often focus on improving recognition performance specifically for test data associated with classes in the training set. However, during the open-world inference process, there are inevitably test data instances from unknown classes, commonly referred to as outliers. This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist. To address this problem, we propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch. In particular, the memory bank is dynamically updated by selecting low-entropy and label-consistent samples in a class-balanced manner. In addition, we develop a self-weighted entropy minimization strategy that assigns higher weight to low-entropy samples. Extensive results demonstrate that STAMP outperforms existing TTA methods in terms of both recognition and outlier detection performance. The code is released at this https URL.
Title:
RADA: Robust and Accurate Feature Learning with Domain Adaptation
Authors: Jingtai He, Gehao Zhang, Tingting Liu, Songlin Du
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Recent advancements in keypoint detection and descriptor extraction have shown impressive performance in local feature learning tasks. However, existing methods generally exhibit suboptimal performance under extreme conditions such as significant appearance changes and domain shifts. In this study, we introduce a multi-level feature aggregation network that incorporates two pivotal components to facilitate the learning of robust and accurate features with domain adaptation. First, we employ domain adaptation supervision to align high-level feature distributions across different domains to achieve invariant domain representations. Second, we propose a Transformer-based booster that enhances descriptor robustness by integrating visual and geometric information through wave position encoding concepts, effectively handling complex conditions. To ensure the accuracy and robustness of features, we adopt a hierarchical architecture to capture comprehensive information and apply meticulous targeted supervision to keypoint detection, descriptor extraction, and their coupled processing. Extensive experiments demonstrate that our method, RADA, achieves excellent results in image matching, camera pose estimation, and visual localization tasks.
Title:
AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection
Abstract
Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are proposed: static and dynamic. Static prompts are shared across all images, serving to preliminarily adapt CLIP for ZSAD. In contrast, dynamic prompts are generated for each test image, providing CLIP with dynamic adaptation capabilities. The combination of static and dynamic prompts is referred to as hybrid prompts, and yields enhanced ZSAD performance. Extensive experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods and can generalize better to different categories and even domains. Finally, our analysis highlights the importance of diverse auxiliary data and optimized prompts for enhanced generalization capacity. Code is available at this https URL.
Keyword: face recognition
Title:
ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition
Authors: Mohammad Saeed Ebrahimi Saadabadi, Sahar Rahimi Malakshan, Ali Dabouei, Nasser M. Nasrabadi
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Aiming to enhance Face Recognition (FR) on Low-Quality (LQ) inputs, recent studies suggest incorporating synthetic LQ samples into training. Although promising, the quality factors that are considered in these works are general rather than FR-specific, \eg, atmospheric turbulence, resolution, \etc. Motivated by the observation of the vulnerability of current FR models to even small Face Alignment Errors (FAE) in LQ images, we present a simple yet effective method that considers FAE as another quality factor that is tailored to FR. We seek to improve LQ FR by enhancing FR models' robustness to FAE. To this aim, we formalize the problem as a combination of differentiable spatial transformations and adversarial data augmentation in FR. We perturb the alignment of the training samples using a controllable spatial transformation and enrich the training with samples expressing FAE. We demonstrate the benefits of the proposed method by conducting evaluations on IJB-B, IJB-C, IJB-S (+4.3\% Rank1), and TinyFace (+2.63\%). \href{this https URL}{this https URL}
Title:
HyperSurf: Quadruped Robot Leg Capable of Surface Recognition with GRU and Real-to-Sim Transferring
Abstract
This paper introduces a system of data collection acceleration and real-to-sim transferring for surface recognition on a quadruped robot. The system features a mechanical single-leg setup capable of stepping on various easily interchangeable surfaces. Additionally, it incorporates a GRU-based Surface Recognition System, inspired by the system detailed in the Dog-Surf paper. This setup facilitates the expansion of dataset collection for model training, enabling data acquisition from hard-to-reach surfaces in laboratory conditions. Furthermore, it opens avenues for transferring surface properties from reality to simulation, thereby allowing the training of optimal gaits for legged robots in simulation environments using a pre-prepared library of digital twins of surfaces. Moreover, enhancements have been made to the GRU-based Surface Recognition System, allowing for the integration of data from both the quadruped robot and the single-leg setup. The dataset and code have been made publicly available.
Keyword: augmentation
Title:
The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments
Authors: Shivansh Sharma, Mathew Huang, Sanat Nair, Alan Wen, Christina Petlowany, Juston Moore, Selma Wanna, Mitch Pryor
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Industry 4.0 introduced AI as a transformative solution for modernizing manufacturing processes. Its successor, Industry 5.0, envisions humans as collaborators and experts guiding these AI-driven manufacturing solutions. Developing these techniques necessitates algorithms capable of safe, real-time identification of human positions in a scene, particularly their hands, during collaborative assembly. Although substantial efforts have curated datasets for hand segmentation, most focus on residential or commercial domains. Existing datasets targeting industrial settings predominantly rely on synthetic data, which we demonstrate does not effectively transfer to real-world operations. Moreover, these datasets lack uncertainty estimations critical for safe collaboration. Addressing these gaps, we present HAGS: Hand and Glove Segmentation Dataset. This dataset provides 1200 challenging examples to build applications toward hand and glove segmentation in industrial human-robot collaboration scenarios as well as assess out-of-distribution images, constructed via green screen augmentations, to determine ML-classifier robustness. We study state-of-the-art, real-time segmentation models to evaluate existing methods. Our dataset and baselines are publicly available: this https URL and this https URL.
Title:
Denoising Long- and Short-term Interests for Sequential Recommendation
Authors: Xinyu Zhang, Beibei Li, Beihong Jin
Subjects: Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Abstract
User interests can be viewed over different time scales, mainly including stable long-term preferences and changing short-term intentions, and their combination facilitates the comprehensive sequential recommendation. However, existing work that focuses on different time scales of user modeling has ignored the negative effects of different time-scale noise, which hinders capturing actual user interests and cannot be resolved by conventional sequential denoising methods. In this paper, we propose a Long- and Short-term Interest Denoising Network (LSIDN), which employs different encoders and tailored denoising strategies to extract long- and short-term interests, respectively, achieving both comprehensive and robust user modeling. Specifically, we employ a session-level interest extraction and evolution strategy to avoid introducing inter-session behavioral noise into long-term interest modeling; we also adopt contrastive learning equipped with a homogeneous exchanging augmentation to alleviate the impact of unintentional behavioral noise on short-term interest modeling. Results of experiments on two public datasets show that LSIDN consistently outperforms state-of-the-art models and achieves significant robustness.
Title:
Data Augmentation in Graph Neural Networks: The Role of Generated Synthetic Graphs
Abstract
Graphs are crucial for representing interrelated data and aiding predictive modeling by capturing complex relationships. Achieving high-quality graph representation is important for identifying linked patterns, leading to improvements in Graph Neural Networks (GNNs) to better capture data structures. However, challenges such as data scarcity, high collection costs, and ethical concerns limit progress. As a result, generative models and data augmentation have become more and more popular. This study explores using generated graphs for data augmentation, comparing the performance of combining generated graphs with real graphs, and examining the effect of different quantities of generated graphs on graph classification tasks. The experiments show that balancing scalability and quality requires different generators based on graph size. Our results introduce a new approach to graph data augmentation, ensuring consistent labels and enhancing classification performance.
Title:
CrossDehaze: Scaling Up Image Dehazing with Cross-Data Vision Alignment and Augmentation
Abstract
In recent years, as computer vision tasks have increasingly relied on high-quality image inputs, the task of image dehazing has received significant attention. Previously, many methods based on priors and deep learning have been proposed to address the task of image dehazing. Ignoring the domain gap between different data, former de-hazing methods usually adopt multiple datasets for explicit training, which often makes the methods themselves be violated. To address this problem, we propose a novel method of internal and external data augmentation to improve the existing dehazing methodology. By using cross-data external augmentor. The dataset inherits samples from different domains that are firmly aligned, making the model learn more robust and generalizable features. By using the internal data augmentation method, the model can fully exploit local information within the images, thereby obtaining more image details. To demonstrate the effectiveness of our proposed method, we conduct training on both the Natural Image Dataset (NID) and the Remote Sensing Image Dataset (RSID). Experimental results show that our method clearly resolves the domain gap in different dehazing datasets and presents a new pipeline for joint training in the dehazing task. Our approach significantly outperforms other advanced methods in dehazing and produces dehazed images that are closest to real haze-free images. The code will be available at: this https URL
Title:
ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition
Authors: Mohammad Saeed Ebrahimi Saadabadi, Sahar Rahimi Malakshan, Ali Dabouei, Nasser M. Nasrabadi
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Aiming to enhance Face Recognition (FR) on Low-Quality (LQ) inputs, recent studies suggest incorporating synthetic LQ samples into training. Although promising, the quality factors that are considered in these works are general rather than FR-specific, \eg, atmospheric turbulence, resolution, \etc. Motivated by the observation of the vulnerability of current FR models to even small Face Alignment Errors (FAE) in LQ images, we present a simple yet effective method that considers FAE as another quality factor that is tailored to FR. We seek to improve LQ FR by enhancing FR models' robustness to FAE. To this aim, we formalize the problem as a combination of differentiable spatial transformations and adversarial data augmentation in FR. We perturb the alignment of the training samples using a controllable spatial transformation and enrich the training with samples expressing FAE. We demonstrate the benefits of the proposed method by conducting evaluations on IJB-B, IJB-C, IJB-S (+4.3\% Rank1), and TinyFace (+2.63\%). \href{this https URL}{this https URL}
Title:
TADA: Temporal Adversarial Data Augmentation for Time Series Data
Authors: Byeong Tak Lee, Joon-myoung Kwon, Yong-Yeon Jo
Abstract
Domain generalization involves training machine learning models to perform robustly on unseen samples from out-of-distribution datasets. Adversarial Data Augmentation (ADA) is a commonly used approach that enhances model adaptability by incorporating synthetic samples, designed to simulate potential unseen samples. While ADA effectively addresses amplitude-related distribution shifts, it falls short in managing temporal shifts, which are essential for time series data. To address this limitation, we propose the Temporal Adversarial Data Augmentation for time teries Data (TADA), which incorporates a time warping technique specifically targeting temporal shifts. Recognizing the challenge of non-differentiability in traditional time warping, we make it differentiable by leveraging phase shifts in the frequency domain. Our evaluations across diverse domains demonstrate that TADA significantly outperforms existing ADA variants, enhancing model performance across time series datasets with varied distributions.
Title:
Unifying Invariant and Variant Features for Graph Out-of-Distribution via Probability of Necessity and Sufficiency
Abstract
Graph Out-of-Distribution (OOD), requiring that models trained on biased data generalize to the unseen test data, has considerable real-world applications. One of the most mainstream methods is to extract the invariant subgraph by aligning the original and augmented data with the help of environment augmentation. However, these solutions might lead to the loss or redundancy of semantic subgraphs and result in suboptimal generalization. To address this challenge, we propose exploiting Probability of Necessity and Sufficiency (PNS) to extract sufficient and necessary invariant substructures. Beyond that, we further leverage the domain variant subgraphs related to the labels to boost the generalization performance in an ensemble manner. Specifically, we first consider the data generation process for graph data. Under mild conditions, we show that the sufficient and necessary invariant subgraph can be extracted by minimizing an upper bound, built on the theoretical advance of the probability of necessity and sufficiency. To further bridge the theory and algorithm, we devise the model called Sufficiency and Necessity Inspired Graph Learning (SNIGL), which ensembles an invariant subgraph classifier on top of latent sufficient and necessary invariant subgraphs, and a domain variant subgraph classifier specific to the test domain for generalization enhancement. Experimental results demonstrate that our SNIGL model outperforms the state-of-the-art techniques on six public benchmarks, highlighting its effectiveness in real-world scenarios.
Title:
Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models
Abstract
Diffusion models, known for their tremendous ability to generate novel and high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent approaches for memory mitigation either only focused on the text modality problem in cross-modal generation tasks or utilized data augmentation strategies. In this paper, we propose a novel training framework for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. To facilitate ``forgetting'' of stored information in diffusion model parameters, we propose an iterative ensemble training strategy by splitting the data into multiple shards for training multiple models and intermittently aggregating these model parameters. Moreover, practical analysis of losses illustrates that the training loss for easily memorable images tends to be obviously lower. Thus, we propose an anti-gradient control method to exclude the sample with a lower loss value from the current mini-batch to avoid memorizing. Extensive experiments and analysis on \crnote{four} datasets are conducted to illustrate the effectiveness of our method, and results show that our method successfully reduces memory capacity while even improving the performance slightly. Moreover, to save the computing cost, we successfully apply our method to fine-tune the well-trained diffusion models by limited epochs, demonstrating the applicability of our method. Code is available in this https URL.
Title:
Poisoning with A Pill: Circumventing Detection in Federated Learning
Authors: Hanxi Guo, Hao Wang, Tao Song, Tianhang Zheng, Yang Hua, Haibing Guan, Xiangyu Zhang
Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Without direct access to the client's data, federated learning (FL) is well-known for its unique strength in data privacy protection among existing distributed machine learning techniques. However, its distributive and iterative nature makes FL inherently vulnerable to various poisoning attacks. To counteract these threats, extensive defenses have been proposed to filter out malicious clients, using various detection metrics. Based on our analysis of existing attacks and defenses, we find that there is a lack of attention to model redundancy. In neural networks, various model parameters contribute differently to the model's performance. However, existing attacks in FL manipulate all the model update parameters with the same strategy, making them easily detectable by common defenses. Meanwhile, the defenses also tend to analyze the overall statistical features of the entire model updates, leaving room for sophisticated attacks. Based on these observations, this paper proposes a generic and attack-agnostic augmentation approach designed to enhance the effectiveness and stealthiness of existing FL poisoning attacks against detection in FL, pointing out the inherent flaws of existing defenses and exposing the necessity of fine-grained FL security. Specifically, we employ a three-stage methodology that strategically constructs, generates, and injects poison (generated by existing attacks) into a pill (a tiny subnet with a novel structure) during the FL training, named as pill construction, pill poisoning, and pill injection accordingly. Extensive experimental results show that FL poisoning attacks enhanced by our method can bypass all the popular defenses, and can gain an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.
Title:
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
Authors: Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in this https URL.
Title:
Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions
Authors: Yihao Ai, Yifei Qi, Bo Wang, Yu Cheng, Xinchao Wang, Robby T. Tan
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the use of paired well-lit and low-light images with ground truths for training, which are impractical due to the inherent challenges associated with annotation on low-light images. To this end, we introduce a novel approach that eliminates the need for low-light ground truths. Our primary novelty lies in leveraging two complementary-teacher networks to generate more reliable pseudo labels, enabling our model achieves competitive performance on extremely low-light images without the need for training with low-light ground truths. Our framework consists of two stages. In the first stage, our model is trained on well-lit data with low-light augmentations. In the second stage, we propose a dual-teacher framework to utilize the unlabeled low-light data, where a center-based main teacher produces the pseudo labels for relatively visible cases, while a keypoints-based complementary teacher focuses on producing the pseudo labels for the missed persons of the main teacher. With the pseudo labels from both teachers, we propose a person-specific low-light augmentation to challenge a student model in training to outperform the teachers. Experimental results on real low-light dataset (ExLPose-OCN) show, our method achieves 6.8% (2.4 AP) improvement over the state-of-the-art (SOTA) method, despite no low-light ground-truth data is used in our approach, in contrast to the SOTA method. Our code will be available at:this https URL.
Title:
Learning deep illumination-robust features from multispectral filter array images
Authors: Anis Amziane
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Multispectral (MS) snapshot cameras equipped with a MS filter array (MSFA), capture multiple spectral bands in a single shot, resulting in a raw mosaic image where each pixel holds only one channel value. The fully-defined MS image is estimated from the raw one through $\textit{demosaicing}$, which inevitably introduces spatio-spectral artifacts. Moreover, training on fully-defined MS images can be computationally intensive, particularly with deep neural networks (DNNs), and may result in features lacking discrimination power due to suboptimal learning of spatio-spectral interactions. Furthermore, outdoor MS image acquisition occurs under varying lighting conditions, leading to illumination-dependent features. This paper presents an original approach to learn discriminant and illumination-robust features directly from raw images. It involves: $\textit{raw spectral constancy}$ to mitigate the impact of illumination, $\textit{MSFA-preserving}$ transformations suited for raw image augmentation to train DNNs on diverse raw textures, and $\textit{raw-mixing}$ to capture discriminant spatio-spectral interactions in raw images. Experiments on MS image classification show that our approach outperforms both handcrafted and recent deep learning-based methods, while also requiring significantly less computational effort.~The source code will be available.
Title:
MODRL-TA:A Multi-Objective Deep Reinforcement Learning Framework for Traffic Allocation in E-Commerce Search
Abstract
Traffic allocation is a process of redistributing natural traffic to products by adjusting their positions in the post-search phase, aimed at effectively fostering merchant growth, precisely meeting customer demands, and ensuring the maximization of interests across various parties within e-commerce platforms. Existing methods based on learning to rank neglect the long-term value of traffic allocation, whereas approaches of reinforcement learning suffer from balancing multiple objectives and the difficulties of cold starts within realworld data environments. To address the aforementioned issues, this paper propose a multi-objective deep reinforcement learning framework consisting of multi-objective Q-learning (MOQ), a decision fusion algorithm (DFM) based on the cross-entropy method(CEM), and a progressive data augmentation system(PDA). Specifically. MOQ constructs ensemble RL models, each dedicated to an objective, such as click-through rate, conversion rate, etc. These models individually determine the position of items as actions, aiming to estimate the long-term value of multiple objectives from an individual perspective. Then we employ DFM to dynamically adjust weights among objectives to maximize long-term value, addressing temporal dynamics in objective preferences in e-commerce scenarios. Initially, PDA trained MOQ with simulated data from offline logs. As experiments progressed, it strategically integrated real user interaction data, ultimately replacing the simulated dataset to alleviate distributional shifts and the cold start problem. Experimental results on real-world online e-commerce systems demonstrate the significant improvements of MODRL-TA, and we have successfully deployed MODRL-TA on an e-commerce search platform.
Title:
Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction
Authors: Dingyao Yu, Yang An, Wei Ye, Xiongfeng Xiao, Shaoguang Mao, Tao Ge, Shikun Zhang
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality corpora, due to the labor-intensive labeling of spelling errors in real-life human writing or typing scenarios. Two data augmentation methods are widely adopted: (1) \textit{Random Replacement} with the guidance of confusion sets and (2) \textit{OCR/ASR-based Generation} that simulates character misusing. However, both methods inevitably introduce noisy data (e.g., false spelling errors), potentially leading to over-correction. By carefully analyzing the two types of corpora, we find that though the latter achieves more robust generalization performance, the former yields better-calibrated CSC models. We then provide a theoretical analysis of this empirical observation, based on which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data samples are fed into a well-calibrated CSC model trained on random replacement-based corpora and then filtered based on prediction confidence. By learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set up impressive state-of-the-art performance on three widely-used benchmarks, while significantly alleviating over-correction (e.g., lowering false positive predictions).
Title:
Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning
Abstract
Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at this https URL.
Keyword: detection
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Keyword: face recognition
Title:
Title:
Keyword: augmentation
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title: