Abstract
A myriad of factors affect large scale ads delivery systems and influence both user experience and revenue. One such factor is proactive detection and calibration of seasonal advertisements to help with increasing conversion and user satisfaction. In this paper, we present Proactive Detection and Calibration of Seasonal Advertisements (PDCaSA), a research problem that is of interest for the ads ranking and recommendation community, both in the industrial setting as well as in research. Our paper provides detailed guidelines from various angles of this problem tested in, and motivated by a large-scale industrial ads ranking system. We share our findings including the clear statement of the problem and its motivation rooted in real-world systems, evaluation metrics, and sheds lights to the existing challenges, lessons learned, and best practices of data annotation and machine learning modeling to tackle this problem. Lastly, we present a conclusive solution we took during this research exploration: to detect seasonality, we leveraged Multimodal LLMs (MLMs) which on our in-house benchmark achieved 0.97 top F1 score. Based on our findings, we envision MLMs as a teacher for knowledge distillation, a machine labeler, and a part of the ensembled and tiered seasonality detection system, which can empower ads ranking systems with enriched seasonal information.
Title:
On the Black-box Explainability of Object Detection Models for Safe and Trustworthy Industrial Applications
Authors: Alain Andres, Aitor Martinez-Seras, Ibai Laña, Javier Del Ser
Abstract
In the realm of human-machine interaction, artificial intelligence has become a powerful tool for accelerating data modeling tasks. Object detection methods have achieved outstanding results and are widely used in critical domains like autonomous driving and video surveillance. However, their adoption in high-risk applications, where errors may cause severe consequences, remains limited. Explainable Artificial Intelligence (XAI) methods aim to address this issue, but many existing techniques are model-specific and designed for classification tasks, making them less effective for object detection and difficult for non-specialists to interpret. In this work we focus on model-agnostic XAI methods for object detection models and propose D-MFPP, an extension of the Morphological Fragmental Perturbation Pyramid (MFPP), which uses segmentation-based mask generation. Additionally, we introduce D-Deletion, a novel metric combining faithfulness and localization, adapted specifically to meet the unique demands of object detectors. We evaluate these methods on real-world industrial and robotic datasets, examining the influence of parameters such as the number of masks, model size, and image resolution on the quality of explanations. Our experiments use single-stage object detection models applied to two safety-critical robotic environments: i) a shared human-robot workspace where safety is of paramount importance, and ii) an assembly area of battery kits, where safety is critical due to the potential for damage among high-risk components. Our findings evince that D-Deletion effectively gauges the performance of explanations when multiple elements of the same class appear in the same scene, while D-MFPP provides a promising alternative to D-RISE when fewer masks are used.
Title:
A Bellman-Ford algorithm for the path-length-weighted distance in graphs
Authors: R. Arnau, J. M. Calabuig, L. M. García Raffi, E. A. Sánchez Pérez, S. Sanjuan
Subjects: Subjects:
Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Mathematical Software (cs.MS)
Abstract
Consider a finite directed graph without cycles in which the arrows are weighted. We present an algorithm for the computation of a new distance, called path-length-weighted distance, which has proven useful for graph analysis in the context of fraud detection. The idea is that the new distance explicitly takes into account the size of the paths in the calculations. Thus, although our algorithm is based on arguments similar to those at work for the Bellman-Ford and Dijkstra methods, it is in fact essentially different. We lay out the appropriate framework for its computation, showing the constraints and requirements for its use, along with some illustrative examples.
Title:
Transparent Tagging for Strategic Social Nudges on User-Generated Misinformation
Authors: Ya-Ting Yang, Tao Li, Quanyan Zhu
Subjects: Subjects:
Computer Science and Game Theory (cs.GT); Social and Information Networks (cs.SI)
Abstract
Social network platforms (SNP), such as X and TikTok, rely heavily on user-generated content to attract users and advertisers, yet they have limited control over content provision, which leads to the proliferation of misinformation across platforms. As countermeasures, SNPs have implemented various policies, such as tweet labeling, to notify users about potentially misleading information, influencing users' responses, either favorably or unfavorably, to the tagged contents. The population-level response creates a social nudge to the content provider that encourages it to supply more authentic content without exerting direct control over the provider. Yet, when designing such tagging policies to leverage social nudges, SNP must be cautious about the potential misdetection of misinformation (wrongly detecting factual content as misinformation and vice versa), which impairs its credibility to generic users and, hence, its ability to create social nudges. This work establishes a Bayesian persuaded branching process to study SNP's tagging policy design under misdetection. Misinformation circulation is modeled by a multi-type branching process, where users are persuaded through tagging to give positive and negative comments that influence the spread of misinformation. When translated into posterior belief space, the SNP's problem is reduced to an equality-constrained convex optimization, the optimal condition of which is given by the Lagrangian characterization. The key finding is that SNP's optimal policy is simply transparent tagging, i.e., revealing the content's authenticity to the user, albeit midsection, which nudges the provider not to generate misinformation. We corroborate our findings using numerical simulations.
Title:
Longitudinal Mammogram Exam-based Breast Cancer Diagnosis Models: Vulnerability to Adversarial Attacks
Abstract
In breast cancer detection and diagnosis, the longitudinal analysis of mammogram images is crucial. Contemporary models excel in detecting temporal imaging feature changes, thus enhancing the learning process over sequential imaging exams. Yet, the resilience of these longitudinal models against adversarial attacks remains underexplored. In this study, we proposed a novel attack method that capitalizes on the feature-level relationship between two sequential mammogram exams of a longitudinal model, guided by both cross-entropy loss and distance metric learning, to achieve significant attack efficacy, as implemented using attack transferring in a black-box attacking manner. We performed experiments on a cohort of 590 breast cancer patients (each has two sequential mammogram exams) in a case-control setting. Results showed that our proposed method surpassed several state-of-the-art adversarial attacks in fooling the diagnosis models to give opposite outputs. Our method remained effective even if the model was trained with the common defending method of adversarial training.
Title:
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment
Authors: Jiaqi Wu, Simin Chen, Zehua Wang, Wei Chen, Zijian Tian, F. Richard Yu, Victor C. M. Leung
Abstract
As the volume of image data grows, data-oriented cloud computing in Internet of Video Things (IoVT) systems encounters latency issues. Task-oriented edge computing addresses this by shifting data analysis to the edge. However, limited computational power of edge devices poses challenges for executing visual tasks. Existing methods struggle to balance high model performance with low resource consumption; lightweight neural networks often underperform, while device-specific models designed by Neural Architecture Search (NAS) fail to adapt to heterogeneous devices. For these issues, we propose a novel co-design framework to optimize neural network architecture and deployment strategies during inference for high-throughput. Specifically, it implements a dynamic model structure based on re-parameterization, coupled with a Roofline-based model partitioning strategy to enhance the computational performance of edge devices. We also employ a multi-objective co-optimization approach to balance throughput and accuracy. Additionally, we derive mathematical consistency and convergence of partitioned models. Experimental results demonstrate significant improvements in throughput (12.05\% on MNIST, 18.83\% on ImageNet) and superior classification accuracy compared to baseline algorithms. Our method consistently achieves stable performance across different devices, underscoring its adaptability. Simulated experiments further confirm its efficacy in high-accuracy, real-time detection for small objects in IoVT systems.
Title:
CausAdv: A Causal-based Framework for Detecting Adversarial Examples
Abstract
Deep learning has led to tremendous success in many real-world applications of computer vision, thanks to sophisticated architectures such as Convolutional neural networks (CNNs). However, CNNs have been shown to be vulnerable to crafted adversarial perturbations in inputs. These inputs appear almost indistinguishable from natural images, yet they are incorrectly classified by CNN architectures. This vulnerability of adversarial examples has led researchers to focus on enhancing the robustness of deep learning models in general, and CNNs in particular, by creating defense and detection methods to distinguish adversarials inputs from natural ones. In this paper, we address the adversarial robustness of CNNs through causal reasoning. We propose CausAdv: a causal framework for detecting adversarial examples based on counterfactual reasoning. CausAdv learns causal and non-causal features of every input, and quantifies the counterfactual information (CI) of every filter of the last convolutional layer. Then we perform statistical analysis on the filters CI of every sample, whether clan or adversarials, to demonstrate how adversarial examples indeed exhibit different CI distributions compared to clean samples. Our results show that causal reasoning enhances the process of adversarials detection without the need to train a separate detector. In addition, we illustrate the efficiency of causal explanations as a helpful detection technique through visualizing the causal features. The results can be reproduced using the code available in the repository: this https URL.
Title:
GWQ: Gradient-Aware Weight Quantization for Large Language Models
Abstract
Large language models (LLMs) show impressive performance in solving complex languagetasks. However, its large number of parameterspresent significant challenges for the deployment and application of the model on edge devices. Compressing large language models to low bits can enable them to run on resource-constrained devices, often leading to performance degradation. To address this problem, we propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers, requiring only a minimal amount of calibration data for outlier detection. GWQ retains the weights corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format. GWQ found experimentally that utilizing the sensitive weights in the gradient localization model is more scientific compared to utilizing the sensitive weights in the Hessian matrix localization model. Compared to current quantization methods, GWQ can be applied to multiple language models and achieves lower PPL on the WikiText2 and C4 dataset. In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization this http URL is also suitable for multimodal model quantization, and the quantized Qwen-VL family model is more accurate than other methods. zero-shot target detection task dataset RefCOCO outperforms the current stat-of-the-arts method SPQR. GWQ achieves 1.2x inference speedup in comparison to the original model, and effectively reduces the inference memory.
Title:
DiabML: AI-assisted diabetes diagnosis method with meta-heuristic-based feature selection
Abstract
Diabetes is a chronic disorder identified by the high sugar level in the blood that can cause various different disorders such as kidney failure, heart attack, sightlessness, and stroke. Developments in the healthcare domain by facilitating the early detection of diabetes risk can help not only caregivers but also patients. AIoMT is a recent technology that integrates IoT and machine learning methods to give services for medical purposes, which is a powerful technology for the early detection of diabetes. In this paper, we take advantage of AIoMT and propose a hybrid diabetes risk detection method, DiabML, which uses the BWO algorithm and ML methods. BWO is utilized for feature selection and SMOTE for imbalance handling in the pre-processing procedure. The simulation results prove the superiority of the proposed DiabML method compared to the existing works. DiabML achieves 86.1\% classification accuracy by AdaBoost classifier outperforms the relevant existing methods.
Title:
Technical Report for SoccerNet Challenge 2022 -- Replay Grounding Task
Abstract
In order to make full use of video information, we transform the replay grounding problem into a video action location problem. We apply a unified network Faster-TAD proposed by us for temporal action detection to get the results of replay grounding. Finally, by observing the data distribution of the training data, we refine the output of the model to get the final submission.
Title:
Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization
Abstract
In the task of temporal action localization of ActivityNet-1.3 datasets, we propose to locate the temporal boundaries of each action and predict action class in untrimmed videos. We first apply VideoSwinTransformer as feature extractor to extract different features. Then we apply a unified network following Faster-TAD to simultaneously obtain proposals and semantic labels. Last, we ensemble the results of different temporal action detection models which complement each other. Faster-TAD simplifies the pipeline of TAD and gets remarkable performance, obtaining comparable results as those of multi-step approaches.
Title:
AAD-LLM: Adaptive Anomaly Detection Using Large Language Models
Authors: Alicia Russell-Gilbert, Alexander Sommers, Andrew Thompson, Logan Cummins, Sudip Mittal, Shahram Rahimi, Maria Seale, Joseph Jaboure, Thomas Arnold, Joshua Church
Abstract
For data-constrained, complex and dynamic industrial environments, there is a critical need for transferable and multimodal methodologies to enhance anomaly detection and therefore, prevent costs associated with system failures. Typically, traditional PdM approaches are not transferable or multimodal. This work examines the use of Large Language Models (LLMs) for anomaly detection in complex and dynamic manufacturing systems. The research aims to improve the transferability of anomaly detection models by leveraging Large Language Models (LLMs) and seeks to validate the enhanced effectiveness of the proposed approach in data-sparse industrial applications. The research also seeks to enable more collaborative decision-making between the model and plant operators by allowing for the enriching of input series data with semantics. Additionally, the research aims to address the issue of concept drift in dynamic industrial settings by integrating an adaptability mechanism. The literature review examines the latest developments in LLM time series tasks alongside associated adaptive anomaly detection methods to establish a robust theoretical framework for the proposed architecture. This paper presents a novel model framework (AAD-LLM) that doesn't require any training or finetuning on the dataset it is applied to and is multimodal. Results suggest that anomaly detection can be converted into a "language" task to deliver effective, context-aware detection in data-constrained industrial applications. This work, therefore, contributes significantly to advancements in anomaly detection methodologies.
Title:
Comparative Evaluation of Applicability Domain Definition Methods for Regression Models
Abstract
The applicability domain refers to the range of data for which the prediction of the predictive model is expected to be reliable and accurate and using a model outside its applicability domain can lead to incorrect results. The ability to define the regions in data space where a predictive model can be safely used is a necessary condition for having safer and more reliable predictions to assure the reliability of new predictions. However, defining the applicability domain of a model is a challenging problem, as there is no clear and universal definition or metric for it. This work aims to make the applicability domain more quantifiable and pragmatic. Eight applicability domain detection techniques were applied to seven regression models, trained on five different datasets, and their performance was benchmarked using a validation framework. We also propose a novel approach based on non-deterministic Bayesian neural networks to define the applicability domain of the model. Our method exhibited superior accuracy in defining the Applicability Domain compared to previous methods, highlighting its potential in this regard.
Title:
Scalable AI Framework for Defect Detection in Metal Additive Manufacturing
Authors: Duy Nhat Phan, Sushant Jha, James P. Mavo, Erin L. Lanigan, Linh Nguyen, Lokendra Poudel, Rahul Bhowmik
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
Abstract
Additive Manufacturing (AM) is transforming the manufacturing sector by enabling efficient production of intricately designed products and small-batch components. However, metal parts produced via AM can include flaws that cause inferior mechanical properties, including reduced fatigue response, yield strength, and fracture toughness. To address this issue, we leverage convolutional neural networks (CNN) to analyze thermal images of printed layers, automatically identifying anomalies that impact these properties. We also investigate various synthetic data generation techniques to address limited and imbalanced AM training data. Our models' defect detection capabilities were assessed using images of Nickel alloy 718 layers produced on a laser powder bed fusion AM machine and synthetic datasets with and without added noise. Our results show significant accuracy improvements with synthetic data, emphasizing the importance of expanding training sets for reliable defect detection. Specifically, Generative Adversarial Networks (GAN)-generated datasets streamlined data preparation by eliminating human intervention while maintaining high performance, thereby enhancing defect detection capabilities. Additionally, our denoising approach effectively improves image quality, ensuring reliable defect detection. Finally, our work integrates these models in the CLoud ADditive MAnufacturing (CLADMA) module, a user-friendly interface, to enhance their accessibility and practicality for AM applications. This integration supports broader adoption and practical implementation of advanced defect detection in AM processes.
Title:
Raspberry PhenoSet: A Phenology-based Dataset for Automated Growth Detection and Yield Estimation
Authors: Parham Jafary, Anna Bazangeya, Michelle Pham, Lesley G. Campbell, Sajad Saeedi, Kourosh Zareinia, Habiba Bougherara
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
The future of the agriculture industry is intertwined with automation. Accurate fruit detection, yield estimation, and harvest time estimation are crucial for optimizing agricultural practices. These tasks can be carried out by robots to reduce labour costs and improve the efficiency of the process. To do so, deep learning models should be trained to perform knowledge-based tasks, which outlines the importance of contributing valuable data to the literature. In this paper, we introduce Raspberry PhenoSet, a phenology-based dataset designed for detecting and segmenting raspberry fruit across seven developmental stages. To the best of our knowledge, Raspberry PhenoSet is the first fruit dataset to integrate biology-based classification with fruit detection tasks, offering valuable insights for yield estimation and precise harvest timing. This dataset contains 1,853 high-resolution images, the highest quality in the literature, captured under controlled artificial lighting in a vertical farm. The dataset has a total of 6,907 instances of mask annotations, manually labelled to reflect the seven phenology stages. We have also benchmarked Raspberry PhenoSet using several state-of-the-art deep learning models, including YOLOv8, YOLOv10, RT-DETR, and Mask R-CNN, to provide a comprehensive evaluation of their performance on the dataset. Our results highlight the challenges of distinguishing subtle phenology stages and underscore the potential of Raspberry PhenoSet for both deep learning model development and practical robotic applications in agriculture, particularly in yield prediction and supply chain management. The dataset and the trained models are publicly available for future studies.
Title:
FedDTPT: Federated Discrete and Transferable Prompt Tuning for Black-Box Large Language Models
Abstract
In recent years, large language models (LLMs) have significantly advanced the field of natural language processing (NLP). By fine-tuning LLMs with data from specific scenarios, these foundation models can better adapt to various downstream tasks. However, the fine-tuning process poses privacy leakage risks, particularly in centralized data processing scenarios. To address user privacy concerns, federated learning (FL) has been introduced to mitigate the risks associated with centralized data collection from multiple sources. Nevertheless, the privacy of LLMs themselves is equally critical, as potential malicious attacks challenge their security, an issue that has received limited attention in current research. Consequently, establishing a trusted multi-party model fine-tuning environment is essential. Additionally, the local deployment of large LLMs incurs significant storage costs and high computational demands. To address these challenges, we propose for the first time a federated discrete and transferable prompt tuning, namely FedDTPT, for black-box large language models. In the client optimization phase, we adopt a token-level discrete prompt optimization method that leverages a feedback loop based on prediction accuracy to drive gradient-free prompt optimization through the MLM API. For server optimization, we employ an attention mechanism based on semantic similarity to filter all local prompt tokens, along with an embedding distance elbow detection and DBSCAN clustering strategy to enhance the filtering process. Experimental results demonstrate that, compared to state-of-the-art methods, our approach achieves higher accuracy, reduced communication overhead, and robustness to non-iid data in a black-box setting. Moreover, the optimized prompts are transferable.
Title:
Mixed Reality Teleoperation Assistance for Direct Control of Humanoids
Authors: Luigi Penco, Kazuhiko Momose, Stephen McCrory, Dexton Anderson, Nicholas Kitchel, Duncan Calvert, Robert J. Griffin
Abstract
Teleoperation plays a crucial role in enabling robot operations in challenging environments, yet existing limitations in effectiveness and accuracy necessitate the development of innovative strategies for improving teleoperated tasks. This article introduces a novel approach that utilizes mixed reality and assistive autonomy to enhance the efficiency and precision of humanoid robot teleoperation. By leveraging Probabilistic Movement Primitives, object detection, and Affordance Templates, the assistance combines user motion with autonomous capabilities, achieving task efficiency while maintaining human-like robot motion. Experiments and feasibility studies on the Nadia robot confirm the effectiveness of the proposed framework.
Title:
Identify Backdoored Model in Federated Learning via Individual Unlearning
Authors: Jiahao Xu, Zikai Zhang, Rui Hu
Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Backdoor attacks present a significant threat to the robustness of Federated Learning (FL) due to their stealth and effectiveness. They maintain both the main task of the FL system and the backdoor task simultaneously, causing malicious models to appear statistically similar to benign ones, which enables them to evade detection by existing defense methods. We find that malicious parameters in backdoored models are inactive on the main task, resulting in a significantly large empirical loss during the machine unlearning process on clean inputs. Inspired by this, we propose MASA, a method that utilizes individual unlearning on local models to identify malicious models in FL. To improve the performance of MASA in challenging non-independent and identically distributed (non-IID) settings, we design pre-unlearning model fusion that integrates local models with knowledge learned from other datasets to mitigate the divergence in their unlearning behaviors caused by the non-IID data distributions of clients. Additionally, we propose a new anomaly detection metric with minimal hyperparameters to filter out malicious models efficiently. Extensive experiments on IID and non-IID datasets across six different attacks validate the effectiveness of MASA. To the best of our knowledge, this is the first work to leverage machine unlearning to identify malicious models in FL. Code is available at \url{this https URL}.
Title:
Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection
Authors: Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson
Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Jailbreaking attacks show how Large Language Models (LLMs) can be tricked into generating harmful outputs using malicious prompts. To prevent these attacks, other LLMs are often used as judges to evaluate the harmfulness of the generated content. However, relying on LLMs as judges can introduce biases into the detection process, which in turn compromises the effectiveness of the evaluation. In this paper, we show that Judge LLMs, like other LLMs, are also affected by token segmentation bias. This bias occurs when tokens are split into smaller sub-tokens, altering their embeddings. This makes it harder for the model to detect harmful content. Specifically, this bias can cause sub-tokens to differ significantly from the original token in the embedding space, leading to incorrect "safe" predictions for harmful content. To exploit this bias in Judge LLMs, we introduce the Emoji Attack -- a method that places emojis within tokens to increase the embedding differences between sub-tokens and their originals. These emojis create new tokens that further distort the token embeddings, exacerbating the bias. To counter the Emoji Attack, we design prompts that help LLMs filter out unusual characters. However, this defense can still be bypassed by using a mix of emojis and other characters. The Emoji Attack can also be combined with existing jailbreaking prompts using few-shot learning, which enables LLMs to generate harmful responses with emojis. These responses are often mistakenly labeled as "safe" by Judge LLMs, allowing the attack to slip through. Our experiments with six state-of-the-art Judge LLMs show that the Emoji Attack allows 25\% of harmful responses to bypass detection by Llama Guard and Llama Guard 2, and up to 75\% by ShieldLM. These results highlight the need for stronger Judge LLMs to address this vulnerability.
Title:
BinEnhance: A Enhancement Framework Based on External Environment Semantics for Binary Code Search
Abstract
Binary code search plays a crucial role in applications like software reuse detection. Currently, existing models are typically based on either internal code semantics or a combination of function call graphs (CG) and internal code semantics. However, these models have limitations. Internal code semantic models only consider the semantics within the function, ignoring the inter-function semantics, making it difficult to handle situations such as function inlining. The combination of CG and internal code semantics is insufficient for addressing complex real-world scenarios. To address these limitations, we propose BinEnhance, a novel framework designed to leverage the inter-function semantics to enhance the expression of internal code semantics for binary code search. Specifically, BinEnhance constructs an External Environment Semantic Graph (EESG), which establishes a stable and analogous external environment for homologous functions by using different inter-function semantic relations (e.g., call, location, data-co-use). After the construction of EESG, we utilize the embeddings generated by existing internal code semantic models to initialize nodes of EESG. Finally, we design a Semantic Enhancement Model (SEM) that uses Relational Graph Convolutional Networks (RGCNs) and a residual block to learn valuable external semantics on the EESG for generating the enhanced semantics embedding. In addition, BinEnhance utilizes data feature similarity to refine the cosine similarity of semantic embeddings. We conduct experiments under six different tasks (e.g., under function inlining scenario) and the results illustrate the performance and robustness of BinEnhance. The application of BinEnhance to HermesSim, Asm2vec, TREX, Gemini, and Asteria on two public datasets results in an improvement of Mean Average Precision (MAP) from 53.6% to 69.7%. Moreover, the efficiency increases fourfold.
Title:
CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research
Abstract
This research addresses command-line embedding in cybersecurity, a field obstructed by the lack of comprehensive datasets due to privacy and regulation concerns. We propose the first dataset of similar command lines, named CyPHER, for training and unbiased evaluation. The training set is generated using a set of large language models (LLMs) comprising 28,520 similar command-line pairs. Our testing dataset consists of 2,807 similar command-line pairs sourced from authentic command-line data. In addition, we propose a command-line embedding model named CmdCaliper, enabling the computation of semantic similarity with command lines. Performance evaluations demonstrate that the smallest version of CmdCaliper (30 million parameters) suppresses state-of-the-art (SOTA) sentence embedding models with ten times more parameters across various tasks (e.g., malicious command-line detection and similar command-line retrieval). Our study explores the feasibility of data generation using LLMs in the cybersecurity domain. Furthermore, we release our proposed command-line dataset, embedding models' weights and all program codes to the public. This advancement paves the way for more effective command-line embedding for future researchers.
Title:
$B^4$: A Black-Box Scrubbing Attack on LLM Watermarks
Authors: Baizhou Huang, Xiao Pu, Xiaojun Wan
Subjects: Subjects:
Computation and Language (cs.CL)
Abstract
Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose $\mathcal{B}^4$, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of $\mathcal{B}^4$ compared with other baselines.
Title:
MonoPlane: Exploiting Monocular Geometric Cues for Generalizable 3D Plane Reconstruction
Authors: Wang Zhao, Jiachen Liu, Sheng Zhang, Yishu Li, Sili Chen, Sharon X Huang, Yong-Jin Liu, Hengkai Guo
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
This paper presents a generalizable 3D plane detection and reconstruction framework named MonoPlane. Unlike previous robust estimator-based works (which require multiple images or RGB-D input) and learning-based works (which suffer from domain shift), MonoPlane combines the best of two worlds and establishes a plane reconstruction pipeline based on monocular geometric cues, resulting in accurate, robust and scalable 3D plane detection and reconstruction in the wild. Specifically, we first leverage large-scale pre-trained neural networks to obtain the depth and surface normals from a single image. These monocular geometric cues are then incorporated into a proximity-guided RANSAC framework to sequentially fit each plane instance. We exploit effective 3D point proximity and model such proximity via a graph within RANSAC to guide the plane fitting from noisy monocular depths, followed by image-level multi-plane joint optimization to improve the consistency among all plane instances. We further design a simple but effective pipeline to extend this single-view solution to sparse-view 3D plane reconstruction. Extensive experiments on a list of datasets demonstrate our superior zero-shot generalizability over baselines, achieving state-of-the-art plane reconstruction performance in a transferring setting. Our code is available at this https URL .
Title:
Strengthening DeFi Security: A Static Analysis Approach to Flash Loan Vulnerabilities
Authors: Ka Wai Wu
Subjects: Subjects:
Cryptography and Security (cs.CR)
Abstract
The rise of Decentralized Finance (DeFi) has brought novel financial opportunities but also exposed serious security vulnerabilities, with flash loans frequently exploited for price manipulation attacks. These attacks, leveraging the atomic nature of flash loans, allow malicious actors to manipulate DeFi protocol oracles and pricing mechanisms within a single transaction, causing substantial financial losses. Traditional smart contract analysis tools address some security risks but often struggle to detect the complex, inter-contract dependencies that make flash loan attacks challenging to identify. In response, we introduce FlashDeFier, an advanced detection framework that enhances static taint analysis to target price manipulation vulnerabilities arising from flash loans. FlashDeFier expands the scope of taint sources and sinks, enabling comprehensive analysis of data flows across DeFi protocols. The framework constructs detailed inter-contract call graphs to capture sophisticated data flow patterns, significantly improving detection accuracy. Tested against a dataset of high-profile DeFi incidents, FlashDeFier identifies 76.4% of price manipulation vulnerabilities, marking a 30% improvement over DeFiTainter. These results highlight the importance of adaptive detection frameworks that evolve alongside DeFi threats, underscoring the need for hybrid approaches combining static, dynamic, and symbolic analysis methods for resilient DeFi security.
Title:
Confidence Aware Learning for Reliable Face Anti-spoofing
Authors: Xingming Long, Jie Zhang, Shiguang Shan
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Current Face Anti-spoofing (FAS) models tend to make overly confident predictions even when encountering unfamiliar scenarios or unknown presentation attacks, which leads to serious potential risks. To solve this problem, we propose a Confidence Aware Face Anti-spoofing (CA-FAS) model, which is aware of its capability boundary, thus achieving reliable liveness detection within this boundary. To enable the CA-FAS to "know what it doesn't know", we propose to estimate its confidence during the prediction of each sample. Specifically, we build Gaussian distributions for both the live faces and the known attacks. The prediction confidence for each sample is subsequently assessed using the Mahalanobis distance between the sample and the Gaussians for the "known data". We further introduce the Mahalanobis distance-based triplet mining to optimize the parameters of both the model and the constructed Gaussians as a whole. Extensive experiments show that the proposed CA-FAS can effectively recognize samples with low prediction confidence and thus achieve much more reliable performance than other FAS models by filtering out samples that are beyond its reliable range.
Title:
An Innovative CGL-MHA Model for Sarcasm Sentiment Recognition Using the MindSpore Framework
Authors: Zhenkai Qin, Qining Luo, Xunyi Nong
Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
The pervasive use of the Internet and social media introduces significant challenges to automated sentiment analysis, particularly for sarcastic expressions in user-generated content. Sarcasm conveys negative emotions through ostensibly positive or exaggerated language, complicating its detection within natural language processing tasks. To address this, we propose an innovative sarcasm detection model integrating Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM), and Multi-Head Attention mechanisms. The CNN component captures local n-gram features, while GRU and LSTM layers model sequential dependencies and contextual information. Multi-Head Attention enhances the model's focus on relevant parts of the input, improving interpretability. Experiments on two sarcasm detection datasets, Headlines and Riloff, demonstrate that the model achieves an accuracy of 81.20% and an F1 score of 80.77% on Headlines, and an accuracy of 79.72% with an F1 score of 61.39% on Riloff, outperforming traditional models. These results validate the effectiveness of our hybrid approach for sarcasm detection in social media texts.
Title:
PARIS: A Practical, Adaptive Trace-Fetching and Real-Time Malicious Behavior Detection System
Abstract
The escalating sophistication of cyber-attacks and the widespread utilization of stealth tactics have led to significant security threats globally. Nevertheless, the existing static detection methods exhibit limited coverage, and traditional dynamic monitoring approaches encounter challenges in bypassing evasion techniques. Thus, it has become imperative to implement nuanced and dynamic analysis to achieve precise behavior detection in real time. There are two pressing concerns associated with current dynamic malware behavior detection solutions. Firstly, the collection and processing of data entail a significant amount of overhead, making it challenging to be employed for real-time detection on the end host. Secondly, these approaches tend to treat malware as a singular entity, thereby overlooking varied behaviors within one instance. To fill these gaps, we propose PARIS, an adaptive trace fetching, lightweight, real-time malicious behavior detection system. Specifically, we monitor malicious behavior with Event Tracing for Windows (ETW) and learn to selectively collect maliciousness-related APIs or call stacks, significantly reducing the data collection overhead. As a result, we can monitor a wider range of APIs and detect more intricate attack behavior. We implemented a prototype of PARIS and evaluated the system overhead, the accuracy of comparative behavior recognition, and the impact of different models and parameters. The result demonstrates that PARIS can reduce over 98.8% of data compared to the raw ETW trace and hence decreases the overhead on the host in terms of memory, bandwidth, and CPU usage with a similar detection accuracy to the baselines that suffer from the high overhead. Furthermore, a breakdown evaluation shows that 80% of the memory and bandwidth savings and a complete reduction in CPU usage can be attributed to our adaptive trace-fetching collector.
Title:
ECG-PPS: Privacy Preserving Disease Diagnosis and Monitoring System for Real-Time ECG Signal
Abstract
This study introduces the development of a state of the art, real time ECG monitoring and analysis system, incorporating cutting edge medical technology and innovative data security measures. Our system performs three distinct functions thaat real time ECG monitoring and disease detection, encrypted storage and synchronized visualization, and statistical analysis on encrypted data. At its core, the system uses a three lead ECG preamplifier connected through a serial port to capture, display, and record real time ECG data. These signals are securely stored in the cloud using robust encryption methods. Authorized medical personnel can access and decrypt this data on their computers, with AES encryption ensuring synchronized real time data tracking and visualization. Furthermore, the system performs statistical operations on the ECG data stored in the cloud without decrypting it, using Fully Homomorphic Encryption (FHE). This enables privacy preserving data analysis while ensuring the security and confidentiality of patient information. By integrating these independent functions, our system significantly enhances the security and efficiency of health monitoring. It supports critical tasks such as disease detection, patient monitoring, and preliminary intervention, all while upholding stringent data privacy standards. We provided detailed discussions on the system's architecture, hardware configuration, software implementation, and clinical performance. The results highlight the potential of this system to improve patient care through secure and efficient ECG monitoring and analysis. This work represents a significant leap forward in medical technology. By incorporating FHE into both data transmission and storage processes, we ensure continuous encryption of data throughout its lifecycle while enabling real time disease diagnosis.
Title:
Advancing Biomedical Signal Security: Real-Time ECG Monitoring with Chaotic Encryption
Abstract
The real time analysis and secure transmission of electrocardiogram (ECG) signals are critical for ensuring both effective medical diagnosis and patient data privacy. In this study, we developed a real time ECG monitoring system that integrates chaotic encryption to protect the integrity and confidentiality of ECG signals during acquisition, transmission, and storage. By leveraging the logistic map as the chaotic function for encryption, our system offers a highly secure framework that dynamically encrypts ECG signals without adding significant latency. To validate the system's reliability, we applied a series of security tests. The results demonstrate that chaotic encryption is effective in enhancing data security, as evidenced by high entropy values and strong key sensitivity, ensuring protection against common cryptographic attacks. Additionally, the system's real time disease detection model, based on deep learning, operates seamlessly with encrypted data, providing accurate diagnosis without compromising security. Our findings indicate that chaotic encryption, paired with real time analysis, is a powerful method for protecting sensitive medical data, making this approach particularly relevant for telemedicine and remote patient monitoring applications. The success of this system highlights its potential for broader application to other biomedical signals, providing a secure infrastructure for the future of digital health.
Title:
False Data Injection Attack Detection in Edge-based Smart Metering Networks with Federated Learning
Authors: Md Raihan Uddin, Ratun Rahman, Dinh C. Nguyen
Abstract
Smart metering networks are increasingly susceptible to cyber threats, where false data injection (FDI) appears as a critical attack. Data-driven-based machine learning (ML) methods have shown immense benefits in detecting FDI attacks via data learning and prediction abilities. Literature works have mostly focused on centralized learning and deploying FDI attack detection models at the control center, which requires data collection from local utilities like meters and transformers. However, this data sharing may raise privacy concerns due to the potential disclosure of household information like energy usage patterns. This paper proposes a new privacy-preserved FDI attack detection by developing an efficient federated learning (FL) framework in the smart meter network with edge computing. Distributed edge servers located at the network edge run an ML-based FDI attack detection model and share the trained model with the grid operator, aiming to build a strong FDI attack detection model without data sharing. Simulation results demonstrate the efficiency of our proposed FL method over the conventional method without collaboration.
Title:
Optimizing Violence Detection in Video Classification Accuracy through 3D Convolutional Neural Networks
Authors: Aarjav Kavathia, Simeon Sayer
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
As violent crimes continue to happen, it becomes necessary to have security cameras that can rapidly identify moments of violence with excellent accuracy. The purpose of this study is to identify how many frames should be analyzed at a time in order to optimize a violence detection model's accuracy as a parameter of the depth of a 3D convolutional network. Previous violence classification models have been created, but their application to live footage may be flawed. In this project, a convolutional neural network was created to analyze optical flow frames of each video. The number of frames analyzed at a time would vary with one, two, three, ten, and twenty frames, and each model would be trained for 20 epochs. The greatest validation accuracy was 94.87% and occurred with the model that analyzed three frames at a time. This means that machine learning models to detect violence may function better when analyzing three frames at a time for this dataset. The methodology used to identify the optimal number of frames to analyze at a time could be used in other applications of video classification, especially those of complex or abstract actions, such as violence.
Title:
Study of Iterative Detection and Decoding for RIS-Aided Multiuser Multi-Antenna Systems
Authors: R. Porto, R. C. de Lamare
Subjects: Subjects:
Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
We present a novel iterative detection and decoding (IDD) scheme for Reconfigurable Intelligent Surface (RIS)-assisted multiuser multiple-antenna systems. The proposed approach introduces a joint iterative detection strategy that integrates Low-Density Parity-Check (LDPC) codes, RIS processing and iterative detection and decoding. In particular, we employ a minimum mean square error receive filter that performs truncation at the RIS and soft interference cancelation at the receiver. Simulation results evaluate the system's overall capacity and bit error rate, and demonstrate substantial improvements in bit error rate across block-fading channels.
Title:
Centrality in Collaboration: A Novel Algorithm for Social Partitioning Gradients in Community Detection for Multiple Oncology Clinical Trial Enrollments
Authors: Benjamin Smith, Tyler Pittman, Wei Xu
Subjects: Subjects:
Social and Information Networks (cs.SI); Methodology (stat.ME); Other Statistics (stat.OT)
Abstract
Patients at a comprehensive cancer center who do not achieve cure or remission following standard treatments often become candidates for clinical trials. Patients who participate in a clinical trial may be suitable for other studies. A key factor influencing patient enrollment in subsequent clinical trials is the structured collaboration between oncologists and most responsible physicians. Possible identification of these collaboration networks can be achieved through the analysis of patient movements between clinical trial intervention types with social network analysis and community detection algorithms. In the detection of oncologist working groups, the present study evaluates three community detection algorithms: Girvan-Newman, Louvain and an algorithm developed by the author. Girvan-Newman identifies each intervention as their own community, while Louvain groups interventions in a manner that is difficult to interpret. In contrast, the author's algorithm groups interventions in a way that is both intuitive and informative, with a gradient effect that is particularly useful for epidemiological research. This lays the groundwork for future subgroup analysis of clustered interventions.
Title:
Mapping Global Floods with 10 Years of Satellite Radar Data
Authors: Amit Misra, Kevin White, Simone Fobi Nsutezo, William Straka, Juan Lavista
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Floods cause extensive global damage annually, making effective monitoring essential. While satellite observations have proven invaluable for flood detection and tracking, comprehensive global flood datasets spanning extended time periods remain scarce. In this study, we introduce a novel deep learning flood detection model that leverages the cloud-penetrating capabilities of Sentinel-1 Synthetic Aperture Radar (SAR) satellite imagery, enabling consistent flood extent mapping in any weather condition. By applying this model to nearly 10 years of SAR data, we create a unique, longitudinal global flood extent dataset with predictions unaffected by cloud coverage, offering comprehensive and consistent insights into historically flood-prone areas over the past decade. We use our model predictions to identify historically flood-prone areas in Ethiopia and demonstrate real-time disaster response capabilities during the May 2024 floods in Kenya. Additionally, our longitudinal analysis reveals potential increasing trends in global flood extent over time, although further validation is required to explore links to climate change. To maximize impact, we provide public access to both our model predictions and a code repository, empowering researchers and practitioners worldwide to advance flood monitoring and enhance disaster response strategies.
Title:
Effective Community Detection Over Streaming Bipartite Networks (Technical Report)
Authors: Nan Zhang, Yutong Ye, Yuyang Wang Xiang Lian, Mingsong Chen
Subjects: Subjects:
Social and Information Networks (cs.SI); Databases (cs.DB)
Abstract
The streaming bipartite graph is extensively used to model the dynamic relationship between two types of entities in many real-world applications, such as movie recommendations, location-based services, and online shopping. Since it contains abundant information, discovering the dense subgraph with high structural cohesiveness (i.e., community detection) in the bipartite streaming graph is becoming a valuable problem. Inspired by this, in this paper, we study the structure of community on the butterfly motif in the bipartite graph. We propose a novel problem, named Community Detection over Streaming Bipartite Network (CD-SBN), which aims to retrieve qualified communities with user-specific query keywords and high structural cohesiveness at snapshot and continuous scenarios. In particular, we formulate the user relationship score in the weighted bipartite network via the butterfly pattern and define a novel $(k,r,\sigma)$-bitruss as the community structure. To efficiently tackle the CD-SBN problem, we design effective pruning strategies to rule out false alarms of $(k,r,\sigma)$-bitruss and propose a hierarchical synopsis to facilitate the CD-SBN processing. Due to the dynamic of streaming bipartite networks, we devise an efficient procedure for incremental graph maintenance. We develop an efficient algorithm to answer the snapshot and continuous CD-SBN query by traversing the synopsis and applying the pruning strategies. With extensive experiments, we demonstrate the efficiency and effectiveness of our proposed CD-SBN processing approach over real/synthetic streaming bipartite networks.
Title:
Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future Envision
Authors: Xiangzhong Luo, Di Liu, Hao Kong, Shuo Huai, Hui Chen, Guochu Xiong, Weichen Liu
Abstract
Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embedded computing systems, making it challenging to deploy powerful DNNs upon real-world embedded computing systems towards ubiquitous embedded intelligence. To alleviate the above computational gap and enable ubiquitous embedded intelligence, we, in this survey, focus on discussing recent efficient deep learning infrastructures for embedded computing systems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embedded computing systems from the lens of (1) efficient manual network design for embedded computing systems, (2) efficient automated network design for embedded computing systems, (3) efficient network compression for embedded computing systems, (4) efficient on-device learning for embedded computing systems, (5) efficient large language models for embedded computing systems, (6) efficient deep learning software and hardware for embedded computing systems, and (7) efficient intelligent applications for embedded computing systems.
Title:
A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning
Abstract
Current visual question answering (VQA) tasks often require constructing multimodal datasets and fine-tuning visual language models, which demands significant time and resources. This has greatly hindered the application of VQA to downstream tasks, such as ship information analysis based on Synthetic Aperture Radar (SAR) imagery. To address this challenge, this letter proposes a novel VQA approach that integrates object detection networks with visual language models, specifically designed for analyzing ships in SAR images. This integration aims to enhance the capabilities of VQA systems, focusing on aspects such as ship location, density, and size analysis, as well as risk behavior detection. Initially, we conducted baseline experiments using YOLO networks on two representative SAR ship detection datasets, SSDD and HRSID, to assess each model's performance in terms of detection accuracy. Based on these results, we selected the optimal model, YOLOv8n, as the most suitable detection network for this task. Subsequently, leveraging the vision-language model Qwen2-VL, we designed and implemented a VQA task specifically for SAR scenes. This task employs the ship location and size information output by the detection network to generate multi-turn dialogues and scene descriptions for SAR imagery. Experimental results indicate that this method not only enables fundamental SAR scene question-answering without the need for additional datasets or fine-tuning but also dynamically adapts to complex, multi-turn dialogue requirements, demonstrating robust semantic understanding and adaptability.
Abstract
Federated learning (FL), with the growing IoT and edge computing, is seen as a promising solution for applications that are latency- and privacy-aware. However, due to the widespread dispersion of data across many clients, it is challenging to monitor client anomalies caused by malfunctioning devices or unexpected events. The majority of FL solutions now in use concentrate on the classification problem, ignoring situations in which anomaly detection may also necessitate privacy preservation and effectiveness. The system in federated learning is unable to manage the potentially flawed behavior of its clients completely. These behaviors include sharing arbitrary parameter values and causing a delay in convergence since clients are chosen at random without knowing the malfunctioning behavior of the client. Client selection is crucial in terms of the efficiency of the federated learning framework. The challenges such as client drift and handling slow clients with low computational capability are well-studied in FL. However, the detection of anomalous clients either for security or for overall performance in the FL frameworks is hardly studied in the literature. In this paper, we propose an anomaly client detection algorithm to overcome malicious client attacks and client drift in FL frameworks. Instead of random client selection, our proposed method utilizes anomaly client detection to remove clients from the FL framework, thereby enhancing the security and efficiency of the overall system. This proposed method improves the global model convergence in almost 50\% fewer communication rounds compared with widely used random client selection using the MNIST dataset.
Title:
Polar R-CNN: End-to-End Lane Detection with Fewer Anchors
Authors: Shengqi Wang, Junmin Liu, Xiangyong Cao, Zengjie Song, Kai Sun
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Lane detection is a critical and challenging task in autonomous driving, particularly in real-world scenarios where traffic lanes can be slender, lengthy, and often obscured by other vehicles, complicating detection efforts. Existing anchor-based methods typically rely on prior lane anchors to extract features and subsequently refine the location and shape of lanes. While these methods achieve high performance, manually setting prior anchors is cumbersome, and ensuring sufficient coverage across diverse datasets often requires a large amount of dense anchors. Furthermore, the use of Non-Maximum Suppression (NMS) to eliminate redundant predictions complicates real-world deployment and may underperform in complex scenarios. In this paper, we propose Polar R-CNN, an end-to-end anchor-based method for lane detection. By incorporating both local and global polar coordinate systems, Polar R-CNN facilitates flexible anchor proposals and significantly reduces the number of anchors required without compromising this http URL, we introduce a triplet head with heuristic structure that supports NMS-free paradigm, enhancing deployment efficiency and performance in scenarios with dense this http URL method achieves competitive results on five popular lane detection benchmarks--Tusimple, CULane,LLAMAS, CurveLanes, and DL-Rai--while maintaining a lightweight design and straightforward structure. Our source code is available at this https URL.
Title:
Building the Self-Improvement Loop: Error Detection and Correction in Goal-Oriented Semantic Communications
Authors: Peizheng Li, Xinyi Lin, Adnan Aijaz
Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Abstract
Error detection and correction are essential for ensuring robust and reliable operation in modern communication systems, particularly in complex transmission environments. However, discussions on these topics have largely been overlooked in semantic communication (SemCom), which focuses on transmitting meaning rather than symbols, leading to significant improvements in communication efficiency. Despite these advantages, semantic errors -- stemming from discrepancies between transmitted and received meanings -- present a major challenge to system reliability. This paper addresses this gap by proposing a comprehensive framework for detecting and correcting semantic errors in SemCom systems. We formally define semantic error, detection, and correction mechanisms, and identify key sources of semantic errors. To address these challenges, we develop a Gaussian process (GP)-based method for latent space monitoring to detect errors, alongside a human-in-the-loop reinforcement learning (HITL-RL) approach to optimize semantic model configurations using user feedback. Experimental results validate the effectiveness of the proposed methods in mitigating semantic errors under various conditions, including adversarial attacks, input feature changes, physical channel variations, and user preference shifts. This work lays the foundation for more reliable and adaptive SemCom systems with robust semantic error management techniques.
Title:
One for All: Multi-Domain Joint Training for Point Cloud Based 3D Object Detection
Authors: Zhenyu Wang, Yali Li, Hengshuang Zhao, Shengjin Wang
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
The current trend in computer vision is to utilize one universal model to address all various tasks. Achieving such a universal model inevitably requires incorporating multi-domain data for joint training to learn across multiple problem scenarios. In point cloud based 3D object detection, however, such multi-domain joint training is highly challenging, because large domain gaps among point clouds from different datasets lead to the severe domain-interference problem. In this paper, we propose \textbf{OneDet3D}, a universal one-for-all model that addresses 3D detection across different domains, including diverse indoor and outdoor scenes, within the \emph{same} framework and only \emph{one} set of parameters. We propose the domain-aware partitioning in scatter and context, guided by a routing mechanism, to address the data interference issue, and further incorporate the text modality for a language-guided classification to unify the multi-dataset label spaces and mitigate the category interference issue. The fully sparse structure and anchor-free head further accommodate point clouds with significant scale disparities. Extensive experiments demonstrate the strong universal ability of OneDet3D to utilize only one trained model for addressing almost all 3D object detection tasks.
Abstract
Current mainstream SAR image object detection methods still lack robustness when dealing with unknown objects in open environments. Open-set detection aims to enable detectors trained on a closed set to detect all known objects and identify unknown objects in open-set environments. The key challenges are how to improve the generalization to potential unknown objects and reduce the empirical classification risk of known categories under strong supervision. To address these challenges, a novel open-set aircraft detector for SAR images is proposed, named Open-Set Aircraft Detection (OSAD), which is equipped with three dedicated components: global context modeling (GCM), location quality-driven pseudo labeling generation (LPG), and prototype contrastive learning (PCL). GCM effectively enhances the network's representation of objects by attention maps which is formed through the capture of long sequential positional relationships. LPG leverages clues about object positions and shapes to optimize localization quality, avoiding overfitting to known category information and enhancing generalization to potential unknown objects. PCL employs prototype-based contrastive encoding loss to promote instance-level intra-class compactness and inter-class variance, aiming to minimize the overlap between known and unknown distributions and reduce the empirical classification risk of known categories. Extensive experiments have demonstrated that the proposed method can effectively detect unknown objects and exhibit competitive performance without compromising closed-set performance. The highest absolute gain which ranges from 0 to 18.36% can be achieved on the average precision of unknown objects.
Title:
ROAD-Waymo: Action Awareness at Scale for Autonomous Driving
Authors: Salman Khan, Izzeddin Teeti, Reza Javanmard Alitappeh, Mihaela C. Stoian, Eleonora Giunchiglia, Gurkirt Singh, Andrew Bradley, Fabio Cuzzolin
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Autonomous Vehicle (AV) perception systems require more than simply seeing, via e.g., object detection or scene segmentation. They need a holistic understanding of what is happening within the scene for safe interaction with other road users. Few datasets exist for the purpose of developing and training algorithms to comprehend the actions of other road users. This paper presents ROAD-Waymo, an extensive dataset for the development and benchmarking of techniques for agent, action, location and event detection in road scenes, provided as a layer upon the (US) Waymo Open dataset. Considerably larger and more challenging than any existing dataset (and encompassing multiple cities), it comes with 198k annotated video frames, 54k agent tubes, 3.9M bounding boxes and a total of 12.4M labels. The integrity of the dataset has been confirmed and enhanced via a novel annotation pipeline designed for automatically identifying violations of requirements specifically designed for this dataset. As ROAD-Waymo is compatible with the original (UK) ROAD dataset, it provides the opportunity to tackle domain adaptation between real-world road scenarios in different countries within a novel benchmark: ROAD++.
Title:
TabSec: A Collaborative Framework for Novel Insider Threat Detection
Abstract
In the era of the Internet of Things (IoT) and data sharing, users frequently upload their personal information to enterprise databases to enjoy enhanced service experiences provided by various online services. However, the widespread presence of system vulnerabilities, remote network intrusions, and insider threats significantly increases the exposure of private enterprise data on the internet. If such data is stolen or leaked by attackers, it can result in severe asset losses and business operation disruptions. To address these challenges, this paper proposes a novel threat detection framework, TabITD. This framework integrates Intrusion Detection Systems (IDS) with User and Entity Behavior Analytics (UEBA) strategies to form a collaborative detection system that bridges the gaps in existing systems' capabilities. It effectively addresses the blurred boundaries between external and insider threats caused by the diversification of attack methods, thereby enhancing the model's learning ability and overall detection performance. Moreover, the proposed method leverages the TabNet architecture, which employs a sparse attention feature selection mechanism that allows TabNet to select the most relevant features at each decision step, thereby improving the detection of rare-class attacks. We evaluated our proposed solution on two different datasets, achieving average accuracies of 96.71% and 97.25%, respectively. The results demonstrate that this approach can effectively detect malicious behaviors such as masquerade attacks and external threats, significantly enhancing network security defenses and the efficiency of network attack detection.
Title:
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Authors: Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, Minlan Yu
Abstract
Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
Title:
High-Pass Graph Convolutional Network for Enhanced Anomaly Detection: A Novel Approach
Abstract
Graph Convolutional Network (GCN) are widely used in Graph Anomaly Detection (GAD) due to their natural compatibility with graph structures, resulting in significant performance improvements. However, most researchers approach GAD as a graph node classification task and often rely on low-pass filters or feature aggregation from neighboring nodes. This paper proposes a novel approach by introducing a High-Pass Graph Convolution Network (HP-GCN) for GAD. The proposed HP-GCN leverages high-frequency components to detect anomalies, as anomalies tend to increase high-frequency signals within the network of normal nodes. Additionally, isolated nodes, which lack interactions with other nodes, present a challenge for Graph Neural Network (GNN). To address this, the model segments the graph into isolated nodes and nodes within connected subgraphs. Isolated nodes learn their features through Multi-Layer Perceptron (MLP), enhancing detection accuracy. The model is evaluated and validated on YelpChi, Amazon, T-Finance, and T-Social datasets. The results showed that the proposed HP-GCN can achieve anomaly detection accuracy of 96.10%, 98.16%, 96.46%, and 98.94%, respectively. The findings demonstrate that the HP-GCN outperforms existing GAD methods based on spatial domain GNN as well as those using low-pass and band-pass filters in spectral domain GCN. The findings underscore the effectiveness of this method in improving anomaly detection performance. Source code can be found at: this https URL.
Title:
TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition
Abstract
Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework's effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.
Title:
DeMod: A Holistic Tool with Explainable Detection and Personalized Modification for Toxicity Censorship
Authors: Yaqiong Li, Peng Zhang, Hansu Gu, Tun Lu, Siyuan Qiao, Yubo Shu, Yiyang Shao, Ning Gu
Abstract
Although there have been automated approaches and tools supporting toxicity censorship for social posts, most of them focus on detection. Toxicity censorship is a complex process, wherein detection is just an initial task and a user can have further needs such as rationale understanding and content modification. For this problem, we conduct a needfinding study to investigate people's diverse needs in toxicity censorship and then build a ChatGPT-based censorship tool named DeMod accordingly. DeMod is equipped with the features of explainable Detection and personalized Modification, providing fine-grained detection results, detailed explanations, and personalized modification suggestions. We also implemented the tool and recruited 35 Weibo users for evaluation. The results suggest DeMod's multiple strengths like the richness of functionality, the accuracy of censorship, and ease of use. Based on the findings, we further propose several insights into the design of content censorship systems.
Title:
KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension
Authors: Jie Yang, Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Ruimao Zhang
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have greatly improved their abilities in image understanding. However, these models often struggle with grasping pixel-level semantic details, e.g., the keypoints of an object. To bridge this gap, we introduce the novel challenge of Semantic Keypoint Comprehension, which aims to comprehend keypoints across different task scenarios, including keypoint semantic understanding, visual prompt-based keypoint detection, and textual prompt-based keypoint detection. Moreover, we introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy to effectively address these challenges. KptLLM underscores the initial discernment of semantics in keypoints, followed by the precise determination of their positions through a chain-of-thought process. With several carefully designed modules, KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations. Our extensive experiments demonstrate KptLLM's superiority in various keypoint detection benchmarks and its unique semantic capabilities in interpreting keypoints.
Title:
Silver medal Solution for Image Matching Challenge 2024
Abstract
Image Matching Challenge 2024 is a competition focused on building 3D maps from diverse image sets, requiring participants to solve fundamental computer vision challenges in image matching across varying angles, lighting, and seasonal changes. This project develops a Pipeline method that combines multiple advanced techniques: using pre-trained EfficientNet-B7 for initial feature extraction and cosine distance-based image pair filtering, employing both KeyNetAffNetHardNet and SuperPoint for keypoint feature extraction, utilizing AdaLAM and SuperGlue for keypoint matching, and finally applying Pycolmap for 3D spatial analysis. The methodology achieved an excellent score of 0.167 on the private leaderboard, with experimental results demonstrating that the combination of KeyNetAffNetHardNet and SuperPoint provides significant advantages in keypoint detection and matching, particularly when dealing with challenging variations in surface texture and environmental conditions that typically degrade traditional algorithm performance.
Title:
LiDAttack: Robust Black-box Attack on LiDAR-based Object Detection
Abstract
Since DNN is vulnerable to carefully crafted adversarial examples, adversarial attack on LiDAR sensors have been extensively studied. We introduce a robust black-box attack dubbed LiDAttack. It utilizes a genetic algorithm with a simulated annealing strategy to strictly limit the location and number of perturbation points, achieving a stealthy and effective attack. And it simulates scanning deviations, allowing it to adapt to dynamic changes in real world scenario variations. Extensive experiments are conducted on 3 datasets (i.e., KITTI, nuScenes, and self-constructed data) with 3 dominant object detection models (i.e., PointRCNN, PointPillar, and PV-RCNN++). The results reveal the efficiency of the LiDAttack when targeting a wide range of object detection models, with an attack success rate (ASR) up to 90%.
Title:
Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models
Authors: Sharat Agarwal
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Objects, in the real world, rarely occur in isolation and exhibit typical arrangements governed by their independent utility, and their expected interaction with humans and other objects in the context. For example, a chair is expected near a table, and a computer is expected on top. Humans use this spatial context and relative placement as an important cue for visual recognition in case of ambiguities. Similar to human's, DNN's exploit contextual information from data to learn representations. Our research focuses on harnessing the contextual aspects of visual data to optimize data annotation and enhance the training of deep networks. Our contributions can be summarized as follows: (1) We introduce the notion of contextual diversity for active learning CDAL and show its applicability in three different visual tasks semantic segmentation, object detection and image classification, (2) We propose a data repair algorithm to curate contextually fair data to reduce model bias, enabling the model to detect objects out of their obvious context, (3) We propose Class-based annotation, where contextually relevant classes are selected that are complementary for model training under domain shift. Understanding the importance of well-curated data, we also emphasize the necessity of involving humans in the loop to achieve accurate annotations and to develop novel interaction strategies that allow humans to serve as fact-checkers. In line with this we are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads. For large-scale annotation, we are employing a strategic combination of human expertise and zero-shot models, while also integrating human input at various stages for continuous feedback.
Title:
HACD: Harnessing Attribute Semantics and Mesoscopic Structure for Community Detection
Authors: Anran Zhang, Xingfen Wang, Yuhan Zhao
Subjects: Subjects:
Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
Abstract
Community detection plays a pivotal role in uncovering closely connected subgraphs, aiding various real-world applications such as recommendation systems and anomaly detection. With the surge of rich information available for entities in real-world networks, the community detection problem in attributed networks has attracted widespread attention. While previous research has effectively leveraged network topology and attribute information for attributed community detection, these methods overlook two critical issues: (i) the semantic similarity between node attributes within the community, and (ii) the inherent mesoscopic structure, which differs from the pairwise connections of the micro-structure. To address these limitations, we propose HACD, a novel attributed community detection model based on heterogeneous graph attention networks. HACD treats node attributes as another type of node, constructs attributed networks into heterogeneous graph structures and employs attribute-level attention mechanisms to capture semantic similarity. Furthermore, HACD introduces a community membership function to explore mesoscopic community structures, enhancing the robustness of detected communities. Extensive experiments demonstrate the effectiveness and efficiency of HACD, outperforming state-of-the-art methods in attributed community detection tasks. Our code is publicly available at this https URL.
Title:
Deep Learning for Leopard Individual Identification: An Adaptive Angular Margin Approach
Authors: David Colomer Matachana
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Accurate identification of individual leopards across camera trap images is critical for population monitoring and ecological studies. This paper introduces a deep learning framework to distinguish between individual leopards based on their unique spot patterns. This approach employs a novel adaptive angular margin method in the form of a modified CosFace architecture. In addition, I propose a preprocessing pipeline that combines RGB channels with an edge detection channel to underscore the critical features learned by the model. This approach significantly outperforms the Triplet Network baseline, achieving a Dynamic Top-5 Average Precision of 0.8814 and a Top-5 Rank Match Detection of 0.9533, demonstrating its potential for open-set learning in wildlife identification. While not surpassing the performance of the SIFT-based Hotspotter algorithm, this method represents a substantial advancement in applying deep learning to patterned wildlife identification. This research contributes to the field of computer vision and provides a valuable tool for biologists aiming to study and protect leopard populations. It also serves as a stepping stone for applying the power of deep learning in Capture-Recapture studies for other patterned species.
Title:
V-CAS: A Realtime Vehicle Anti Collision System Using Vision Transformer on Multi-Camera Streams
Authors: Muhammad Waqas Ashraf, Ali Hassan, Imad Ali Shah
Abstract
This paper introduces a real-time Vehicle Collision Avoidance System (V-CAS) designed to enhance vehicle safety through adaptive braking based on environmental perception. V-CAS leverages the advanced vision-based transformer model RT-DETR, DeepSORT tracking, speed estimation, brake light detection, and an adaptive braking mechanism. It computes a composite collision risk score based on vehicles' relative accelerations, distances, and detected braking actions, using brake light signals and trajectory data from multiple camera streams to improve scene perception. Implemented on the Jetson Orin Nano, V-CAS enables real-time collision risk assessment and proactive mitigation through adaptive braking. A comprehensive training process was conducted on various datasets for comparative analysis, followed by fine-tuning the selected object detection model using transfer learning. The system's effectiveness was rigorously evaluated on the Car Crash Dataset (CCD) from YouTube and through real-time experiments, achieving over 98% accuracy with an average proactive alert time of 1.13 seconds. Results indicate significant improvements in object detection and tracking, enhancing collision avoidance compared to traditional single-camera methods. This research demonstrates the potential of low-cost, multi-camera embedded vision transformer systems to advance automotive safety through enhanced environmental perception and proactive collision avoidance mechanisms.
Abstract
Deep neural networks (DNNs) often suffer from the overconfidence issue, where incorrect predictions are made with high confidence scores, hindering the applications in critical systems. In this paper, we propose a novel approach called Typicalness-Aware Learning (TAL) to address this issue and improve failure detection performance. We observe that, with the cross-entropy loss, model predictions are optimized to align with the corresponding labels via increasing logit magnitude or refining logit direction. However, regarding atypical samples, the image content and their labels may exhibit disparities. This discrepancy can lead to overfitting on atypical samples, ultimately resulting in the overconfidence issue that we aim to address. To tackle the problem, we have devised a metric that quantifies the typicalness of each sample, enabling the dynamic adjustment of the logit magnitude during the training process. By allowing atypical samples to be adequately fitted while preserving reliable logit direction, the problem of overconfidence can be mitigated. TAL has been extensively evaluated on benchmark datasets, and the results demonstrate its superiority over existing failure detection methods. Specifically, TAL achieves a more than 5% improvement on CIFAR100 in terms of the Area Under the Risk-Coverage Curve (AURC) compared to the state-of-the-art. Code is available at this https URL.
Title:
Tree level change detection over Ahmedabad city using very high resolution satellite images and Deep Learning
Authors: Jai G Singla, Gautam Jaiswal
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
In this study, 0.5m high resolution satellite datasets over Indian urban region was used to demonstrate the applicability of deep learning models over Ahmedabad, India. Here, YOLOv7 instance segmentation model was trained on well curated trees canopy dataset (6500 images) in order to carry out the change detection. During training, evaluation metrics such as bounding box regression and mask regression loss, mean average precision (mAP) and stochastic gradient descent algorithm were used for evaluating and optimizing the performance of model. After the 500 epochs, the mAP of 0.715 and 0.699 for individual tree detection and tree canopy mask segmentation were obtained. However, by further tuning hyper parameters of the model, maximum accuracy of 80 % of trees detection with false segmentation rate of 2% on data was obtained.
Title:
Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation
Authors: Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Shaofeng Zhang, Yi Yu, Wenxian Yu, Junchi Yan
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
In recent years, aerial object detection has been increasingly pivotal in various earth observation applications. However, current algorithms are limited to detecting a set of pre-defined object categories, demanding sufficient annotated training samples, and fail to detect novel object categories. In this paper, we put forth a novel formulation of the aerial object detection problem, namely open-vocabulary aerial object detection (OVAD), which can detect objects beyond training categories without costly collecting new labeled data. We propose CastDet, a CLIP-activated student-teacher detection framework that serves as the first OVAD detector specifically designed for the challenging aerial scenario, where objects often exhibit weak appearance features and arbitrary orientations. Our framework integrates a robust localization teacher along with several box selection strategies to generate high-quality proposals for novel objects. Additionally, the RemoteCLIP model is adopted as an omniscient teacher, which provides rich knowledge to enhance classification capabilities for novel categories. A dynamic label queue is devised to maintain high-quality pseudo-labels during training. By doing so, the proposed CastDet boosts not only novel object proposals but also classification. Furthermore, we extend our approach from horizontal OVAD to oriented OVAD with tailored algorithm designs to effectively manage bounding box representation and pseudo-label generation. Extensive experiments for both tasks on multiple existing aerial object detection datasets demonstrate the effectiveness of our approach. The code is available at this https URL.
Title:
Differentially Private Integrated Decision Gradients (IDG-DP) for Radar-based Human Activity Recognition
Authors: Idris Zakariyya, Linda Tran, Kaushik Bhargav Sivangi, Paul Henderson, Fani Deligianni
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract
Human motion analysis offers significant potential for healthcare monitoring and early detection of diseases. The advent of radar-based sensing systems has captured the spotlight for they are able to operate without physical contact and they can integrate with pre-existing Wi-Fi networks. They are also seen as less privacy-invasive compared to camera-based systems. However, recent research has shown high accuracy in recognizing subjects or gender from radar gait patterns, raising privacy concerns. This study addresses these issues by investigating privacy vulnerabilities in radar-based Human Activity Recognition (HAR) systems and proposing a novel method for privacy preservation using Differential Privacy (DP) driven by attributions derived with Integrated Decision Gradient (IDG) algorithm. We investigate Black-box Membership Inference Attack (MIA) Models in HAR settings across various levels of attacker-accessible information. We extensively evaluated the effectiveness of the proposed IDG-DP method by designing a CNN-based HAR model and rigorously assessing its resilience against MIAs. Experimental results demonstrate the potential of IDG-DP in mitigating privacy attacks while maintaining utility across all settings, particularly excelling against label-only and shadow model black-box MIA attacks. This work represents a crucial step towards balancing the need for effective radar-based HAR with robust privacy protection in healthcare environments.
Title:
Unsupervised detection of semantic correlations in big data
Authors: Santiago Acevedo, Alex Rodriguez, Alessandro Laio
Abstract
In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.
Title:
Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery
Authors: Robert Fonod, Haechan Cho, Hwasoo Yeo, Nikolas Geroliminis
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
This paper presents a framework for extracting georeferenced vehicle trajectories from high-altitude drone footage, addressing key challenges in urban traffic monitoring and limitations of traditional ground-based systems. We employ state-of-the-art computer vision and deep learning to create an end-to-end pipeline that enhances vehicle detection, tracking, and trajectory stabilization. Conducted in the Songdo International Business District, South Korea, the study used a multi-drone experiment over 20 intersections, capturing approximately 12TB of 4K video data over four days. We developed a novel track stabilization method that uses detected vehicle bounding boxes as exclusion masks during image registration, which, combined with advanced georeferencing techniques, accurately transforms vehicle coordinates into real-world geographical data. Additionally, our framework includes robust vehicle dimension estimation and detailed road segmentation for in-depth traffic analysis. The framework produced two high-quality datasets: the Songdo Traffic dataset, comprising nearly 1 million unique vehicle trajectories, and the Songdo Vision dataset, containing over 5,000 human-annotated frames with about 300,000 vehicle instances in four classes. Comparisons between drone-derived data and high-precision sensor data from an instrumented probe vehicle highlight the accuracy and consistency of our framework's extraction in dense urban settings. By publicly releasing these datasets and the pipeline source code, this work sets new benchmarks for data quality, reproducibility, and scalability in traffic research. Results demonstrate the potential of integrating drone technology with advanced computer vision for precise, cost-effective urban traffic monitoring, providing valuable resources for the research community to develop intelligent transportation systems and improve traffic management strategies.
Title:
SIRA: Scalable Inter-frame Relation and Association for Radar Perception
Authors: Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi
Abstract
Conventional radar feature extraction faces limitations due to low spatial resolution, noise, multipath reflection, the presence of ghost targets, and motion blur. Such limitations can be exacerbated by nonlinear object motion, particularly from an ego-centric viewpoint. It becomes evident that to address these challenges, the key lies in exploiting temporal feature relation over an extended horizon and enforcing spatial motion consistency for effective association. To this end, this paper proposes SIRA (Scalable Inter-frame Relation and Association) with two designs. First, inspired by Swin Transformer, we introduce extended temporal relation, generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability. Second, we propose motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association. Our approach achieves 58.11 mAP@0.5 for oriented object detection and 47.79 MOTA for multiple object tracking on the Radiate dataset, surpassing previous state-of-the-art by a margin of +4.11 mAP@0.5 and +9.94 MOTA, respectively.
Title:
Optimization Models to Meet the Conditions of Order Preservation in the Analytic Hierarchy Process
Abstract
Deriving a priority vector from a pairwise comparison matrix (PCM) is a crucial step in the Analytical Hierarchy Process (AHP). Although there exists a priority vector that satisfies the conditions of order preservation (COP), the priority vectors obtained through existing prioritization methods frequently violate these conditions, resulting in numerous COP violations. To address this issue, this paper introduces a novel procedure to manage COP violations in AHP. Firstly, we prove that the index-exchangeability condition is both a necessary and sufficient condition for determining whether a priority vector satisfies COP. This enables the direct detection of COP violations, relying solely on the pairwise comparison preferences of decision-makers, rather than the prioritization methods utilized. Subsequently, we propose the Minimal Number of Violations and Deviations Method (MNVDM) model, which aims to derive a priority vector with the minimal number of COP violations. In particular, the MNVDM can obtain a violation-free priority vector when the PCM meets the index exchangeability conditions. Furthermore, an optimization model based on minimizing information loss is designed to ensure the COP by revising the preferences when the index-exchangeability conditions are violated. Finally, the feasibility and efficiency of the proposed models are validated through numerical examples and Monte Carlo simulation experiments. Our implementation is available at: this https URL.
Title:
Advancing Cyber-Attack Detection in Power Systems: A Comparative Study of Machine Learning and Graph Neural Network Approaches
Abstract
This paper explores the detection and localization of cyber-attacks on time-series measurements data in power systems, focusing on comparing conventional machine learning (ML) like k-means, deep learning method like autoencoder, and graph neural network (GNN)-based techniques. We assess the detection accuracy of these approaches and their potential to pinpoint the locations of specific sensor measurements under attack. Given the demonstrated success of GNNs in other time-series anomaly detection applications, we aim to evaluate their performance within the context of power systems cyber-attacks on sensor measurements. Utilizing the IEEE 68-bus system, we simulated four types of false data attacks, including scaling attacks, additive attacks, and their combinations, to test the selected approaches. Our results indicate that GNN-based methods outperform k-means and autoencoder in detection. Additionally, GNNs show promise in accurately localizing attacks for simple scenarios, although they still face challenges in more complex cases, especially ones that involve combinations of scaling and additive attacks.
Title:
Memory-Efficient Community Detection on Large Graphs Using Weighted Sketches
Authors: Subhajit Sahu
Subjects: Subjects:
Social and Information Networks (cs.SI); Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract
Community detection in graphs identifies groups of nodes with denser connections within the groups than between them, and while existing studies often focus on optimizing detection performance, memory constraints become critical when processing large graphs on shared-memory systems. We recently proposed efficient implementations of the Louvain, Leiden, and Label Propagation Algorithms (LPA) for community detection. However, these incur significant memory overhead from the use of collision-free per-thread hashtables. To address this, we introduce memory-efficient alternatives using weighted Misra-Gries (MG) sketches, which replace the per-thread hashtables, and reduce memory demands in Louvain, Leiden, and LPA implementations - while incurring only a minor quality drop (up to 1%) and moderate runtime penalties. We believe that these approaches, though slightly slower, are well-suited for parallel processing and could outperform current memory-intensive techniques on systems with many threads.
Title:
Simulation of Nanorobots with Artificial Intelligence and Reinforcement Learning for Advanced Cancer Cell Detection and Tracking
Authors: Shahab Kavousinejad
Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph); Other Quantitative Biology (q-bio.OT)
Abstract
Nanorobots are a promising development in targeted drug delivery and the treatment of neurological disorders, with potential for crossing the blood-brain barrier (BBB). These small devices leverage advancements in nanotechnology and bioengineering for precise navigation and targeted payload delivery, particularly for conditions like brain tumors, Alzheimer's disease, and Parkinson's disease. Recent progress in artificial intelligence (AI) and machine learning (ML) has improved the navigation and effectiveness of nanorobots, allowing them to detect and interact with cancer cells through biomarker analysis. This study presents a new reinforcement learning (RL) framework for optimizing nanorobot navigation in complex biological environments, focusing on cancer cell detection by analyzing the concentration gradients of surrounding biomarkers. We utilize a computer simulation model to explore the behavior of nanorobots in a three-dimensional space with cancer cells and biological barriers. The proposed method uses Q-learning to refine movement strategies based on real-time biomarker concentration data, enabling nanorobots to autonomously navigate to cancerous tissues for targeted drug delivery. This research lays the groundwork for future laboratory experiments and clinical applications, with implications for personalized medicine and less invasive cancer treatments. The integration of intelligent nanorobots could revolutionize therapeutic strategies, reducing side effects and enhancing treatment effectiveness for cancer patients. Further research will investigate the practical deployment of these technologies in medical settings, aiming to unlock the full potential of nanorobotics in healthcare.
Title:
Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models
Authors: Guangzhi Xiong, Eric Xie, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang
Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in various scientific domains, from natural language processing to complex problem-solving tasks. Their ability to understand and generate human-like text has opened up new possibilities for advancing scientific research, enabling tasks such as data analysis, literature review, and even experimental design. One of the most promising applications of LLMs in this context is hypothesis generation, where they can identify novel research directions by analyzing existing knowledge. However, despite their potential, LLMs are prone to generating ``hallucinations'', outputs that are plausible-sounding but factually incorrect. Such a problem presents significant challenges in scientific fields that demand rigorous accuracy and verifiability, potentially leading to erroneous or misleading conclusions. To overcome these challenges, we propose KG-CoI (Knowledge Grounded Chain of Ideas), a novel system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs (KGs). KG-CoI guides LLMs through a structured reasoning process, organizing their output as a chain of ideas (CoI), and includes a KG-supported module for the detection of hallucinations. With experiments on our newly constructed hypothesis generation dataset, we demonstrate that KG-CoI not only improves the accuracy of LLM-generated hypotheses but also reduces the hallucination in their reasoning chains, highlighting its effectiveness in advancing real-world scientific research.
Keyword: face recognition
Title:
Randomized coupled decompositions
Authors: Erna Begovic, Anita Carevic, Ivana Sain Glibic
Abstract
Coupled decompositions are a widely used tool for data fusion. This paper studies the coupled matrix factorization (CMF) where two matrices $X$ and $Y$ are represented in a low-rank format sharing one common factor, as well as the coupled matrix and tensor factorization (CMTF) where a matrix $Y$ and a tensor $\mathcal{X}$ are represented in a low-rank format sharing a factor matrix. We show that these problems are equivalent to the low-rank approximation of the matrix $[X \ Y]$ for CMF, that is $[X_{(1)} \ Y]$ for CMTF. Then, in order to speed up computation process, we adapt several randomization techniques, namely, randomized SVD, randomized subspace iteration, and randomized block Krylov iteration to the algorithms for coupled decompositions. We present extensive results of the numerical tests. Furthermore, as a novel approach and with a high success rate, we apply our randomized algorithms to the face recognition problem.
Title:
Digi2Real: Bridging the Realism Gap in Synthetic Data Face Recognition via Foundation Models
Authors: Anjith George, Sebastien Marcel
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
The accuracy of face recognition systems has improved significantly in the past few years, thanks to the large amount of data collected and the advancement in neural network architectures. However, these large-scale datasets are often collected without explicit consent, raising ethical and privacy concerns. To address this, there have been proposals to use synthetic datasets for training face recognition models. Yet, such models still rely on real data to train the generative models and generally exhibit inferior performance compared to those trained on real datasets. One of these datasets, DigiFace, uses a graphics pipeline to generate different identities and different intra-class variations without using real data in training the models. However, the performance of this approach is poor on face recognition benchmarks, possibly due to the lack of realism in the images generated from the graphics pipeline. In this work, we introduce a novel framework for realism transfer aimed at enhancing the realism of synthetically generated face images. Our method leverages the large-scale face foundation model, and we adapt the pipeline for realism enhancement. By integrating the controllable aspects of the graphics pipeline with our realism enhancement technique, we generate a large amount of realistic variations-combining the advantages of both approaches. Our empirical evaluations demonstrate that models trained using our enhanced dataset significantly improve the performance of face recognition systems over the baseline. The source code and datasets will be made available publicly.
Keyword: augmentation
Title:
Integrating Symbolic Neural Networks with Building Physics: A Study and Proposal
Authors: Xia Chen, Guoquan Lv, Xinwei Zhuang, Carlos Duarte, Stefano Schiavon, Philipp Geyer
Abstract
Symbolic neural networks, such as Kolmogorov-Arnold Networks (KAN), offer a promising approach for integrating prior knowledge with data-driven methods, making them valuable for addressing inverse problems in scientific and engineering domains. This study explores the application of KAN in building physics, focusing on predictive modeling, knowledge discovery, and continuous learning. Through four case studies, we demonstrate KAN's ability to rediscover fundamental equations, approximate complex formulas, and capture time-dependent dynamics in heat transfer. While there are challenges in extrapolation and interpretability, we highlight KAN's potential to combine advanced modeling methods for knowledge augmentation, which benefits energy efficiency, system optimization, and sustainability assessments beyond the personal knowledge constraints of the modelers. Additionally, we propose a model selection decision tree to guide practitioners in appropriate applications for building physics.
Title:
Saliency-Based diversity and fairness Metric and FaceKeepOriginalAugment: A Novel Approach for Enhancing Fairness and Diversity
Abstract
Data augmentation has become a pivotal tool in enhancing the performance of computer vision tasks, with the KeepOriginalAugment method emerging as a standout technique for its intelligent incorporation of salient regions within less prominent areas, enabling augmentation in both regions. Despite its success in image classification, its potential in addressing biases remains unexplored. In this study, we introduce an extension of the KeepOriginalAugment method, termed FaceKeepOriginalAugment, which explores various debiasing aspects-geographical, gender, and stereotypical biases-in computer vision models. By maintaining a delicate balance between data diversity and information preservation, our approach empowers models to exploit both diverse salient and non-salient regions, thereby fostering increased diversity and debiasing effects. We investigate multiple strategies for determining the placement of the salient region and swapping perspectives to decide which part undergoes augmentation. Leveraging the Image Similarity Score (ISS), we quantify dataset diversity across a range of datasets, including Flickr Faces HQ (FFHQ), WIKI, IMDB, Labelled Faces in the Wild (LFW), UTK Faces, and Diverse Dataset. We evaluate the effectiveness of FaceKeepOriginalAugment in mitigating gender bias across CEO, Engineer, Nurse, and School Teacher datasets, utilizing the Image-Image Association Score (IIAS) in convolutional neural networks (CNNs) and vision transformers (ViTs). Our findings shows the efficacy of FaceKeepOriginalAugment in promoting fairness and inclusivity within computer vision models, demonstrated by reduced gender bias and enhanced overall fairness. Additionally, we introduce a novel metric, Saliency-Based Diversity and Fairness Metric, which quantifies both diversity and fairness while handling data imbalance across various datasets.
Title:
Enhancing Question Answering Precision with Optimized Vector Retrieval and Instructions
Authors: Lixiao Yang, Mengyang Xu, Weimao Ke
Subjects: Subjects:
Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
Question-answering (QA) is an important application of Information Retrieval (IR) and language models, and the latest trend is toward pre-trained large neural networks with embedding parameters. Augmenting QA performances with these LLMs requires intensive computational resources for fine-tuning. We propose an innovative approach to improve QA task performances by integrating optimized vector retrievals and instruction methodologies. Based on retrieval augmentation, the process involves document embedding, vector retrieval, and context construction for optimal QA results. We experiment with different combinations of text segmentation techniques and similarity functions, and analyze their impacts on QA performances. Results show that the model with a small chunk size of 100 without any overlap of the chunks achieves the best result and outperforms the models based on semantic segmentation using sentences. We discuss related QA examples and offer insight into how model performances are improved within the two-stage framework.
Title:
AquaFuse: Waterbody Fusion for Physics Guided View Synthesis of Underwater Scenes
Authors: Md Abu Bakr Siddique, Jiayi Wu, Ioannis Rekleitis, Md Jahidul Islam
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Abstract
We introduce the idea of AquaFuse, a physics-based method for synthesizing waterbody properties in underwater imagery. We formulate a closed-form solution for waterbody fusion that facilitates realistic data augmentation and geometrically consistent underwater scene rendering. AquaFuse leverages the physical characteristics of light propagation underwater to synthesize the waterbody from one scene to the object contents of another. Unlike data-driven style transfer, AquaFuse preserves the depth consistency and object geometry in an input scene. We validate this unique feature by comprehensive experiments over diverse underwater scenes. We find that the AquaFused images preserve over 94% depth consistency and 90-95% structural similarity of the input scenes. We also demonstrate that it generates accurate 3D view synthesis by preserving object geometry while adapting to the inherent waterbody fusion process. AquaFuse opens up a new research direction in data augmentation by geometry-preserving style transfer for underwater imaging and robot vision applications.
Title:
OnlineTAS: An Online Baseline for Temporal Action Segmentation
Authors: Qing Zhong, Guodong Ding, Angela Yao
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Temporal context plays a significant role in temporal action segmentation. In an offline setting, the context is typically captured by the segmentation network after observing the entire sequence. However, capturing and using such context information in an online setting remains an under-explored problem. This work presents the an online framework for temporal action segmentation. At the core of the framework is an adaptive memory designed to accommodate dynamic changes in context over time, alongside a feature augmentation module that enhances the frames with the memory. In addition, we propose a post-processing approach to mitigate the severe over-segmentation in the online setting. On three common segmentation benchmarks, our approach achieves state-of-the-art performance.
Title:
RLE: A Unified Perspective of Data Augmentation for Cross-Spectral Re-identification
Authors: Lei Tan, Yukang Zhang, Keke Han, Pingyang Dai, Yan Zhang, Yongjian Wu, Rongrong Ji
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
This paper makes a step towards modeling the modality discrepancy in the cross-spectral re-identification task. Based on the Lambertain model, we observe that the non-linear modality discrepancy mainly comes from diverse linear transformations acting on the surface of different materials. From this view, we unify all data augmentation strategies for cross-spectral re-identification by mimicking such local linear transformations and categorizing them into moderate transformation and radical transformation. By extending the observation, we propose a Random Linear Enhancement (RLE) strategy which includes Moderate Random Linear Enhancement (MRLE) and Radical Random Linear Enhancement (RRLE) to push the boundaries of both types of transformation. Moderate Random Linear Enhancement is designed to provide diverse image transformations that satisfy the original linear correlations under constrained conditions, whereas Radical Random Linear Enhancement seeks to generate local linear transformations directly without relying on external information. The experimental results not only demonstrate the superiority and effectiveness of RLE but also confirm its great potential as a general-purpose data augmentation for cross-spectral re-identification. The code is available at \textcolor{magenta}{\url{this https URL}}.
Title:
Combining Financial Data and News Articles for Stock Price Movement Prediction Using Large Language Models
Authors: Ali Elahi, Fatemeh Taghvaei
Subjects: Subjects:
Information Retrieval (cs.IR); Computational Finance (q-fin.CP)
Abstract
Predicting financial markets and stock price movements requires analyzing a company's performance, historic price movements, industry-specific events alongside the influence of human factors such as social media and press coverage. We assume that financial reports (such as income statements, balance sheets, and cash flow statements), historical price data, and recent news articles can collectively represent aforementioned factors. We combine financial data in tabular format with textual news articles and employ pre-trained Large Language Models (LLMs) to predict market movements. Recent research in LLMs has demonstrated that they are able to perform both tabular and text classification tasks, making them our primary model to classify the multi-modal data. We utilize retrieval augmentation techniques to retrieve and attach relevant chunks of news articles to financial metrics related to a company and prompt the LLMs in zero, two, and four-shot settings. Our dataset contains news articles collected from different sources, historic stock price, and financial report data for 20 companies with the highest trading volume across different industries in the stock market. We utilized recently released language models for our LLM-based classifier, including GPT- 3 and 4, and LLaMA- 2 and 3 models. We introduce an LLM-based classifier capable of performing classification tasks using combination of tabular (structured) and textual (unstructured) data. By using this model, we predicted the movement of a given stock's price in our dataset with a weighted F1-score of 58.5% and 59.1% and Matthews Correlation Coefficient of 0.175 for both 3-month and 6-month periods.
Title:
Online Relational Inference for Evolving Multi-agent Interacting Systems
Abstract
We introduce a novel framework, Online Relational Inference (ORI), designed to efficiently identify hidden interaction graphs in evolving multi-agent interacting systems using streaming data. Unlike traditional offline methods that rely on a fixed training set, ORI employs online backpropagation, updating the model with each new data point, thereby allowing it to adapt to changing environments in real-time. A key innovation is the use of an adjacency matrix as a trainable parameter, optimized through a new adaptive learning rate technique called AdaRelation, which adjusts based on the historical sensitivity of the decoder to changes in the interaction graph. Additionally, a data augmentation method named Trajectory Mirror (TM) is introduced to improve generalization by exposing the model to varied trajectory patterns. Experimental results on both synthetic datasets and real-world data (CMU MoCap for human motion) demonstrate that ORI significantly improves the accuracy and adaptability of relational inference in dynamic settings compared to existing methods. This approach is model-agnostic, enabling seamless integration with various neural relational inference (NRI) architectures, and offers a robust solution for real-time applications in complex, evolving systems.
Title:
Capsule Vision Challenge 2024: Multi-Class Abnormality Classification for Video Capsule Endoscopy
Abstract
This study presents an approach to developing a model for classifying abnormalities in video capsule endoscopy (VCE) frames. Given the challenges of data imbalance, we implemented a tiered augmentation strategy using the albumentations library to enhance minority class representation. Additionally, we addressed learning complexities by progressively structuring training tasks, allowing the model to differentiate between normal and abnormal cases and then gradually adding more specific classes based on data availability. Our pipeline, developed in PyTorch, employs a flexible architecture enabling seamless adjustments to classification complexity. We tested our approach using ResNet50 and a custom ViT-CNN hybrid model, with training conducted on the Kaggle platform. This work demonstrates a scalable approach to abnormality classification in VCE.
Title:
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
Authors: Seongsu Ha, Chaeyun Kim, Donghwa Kim, Junho Lee, Sangho Lee, Joonseok Lee
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.
Title:
PreCM: The Padding-based Rotation Equivariant Convolution Mode for Semantic Segmentation
Abstract
Semantic segmentation is an important branch of image processing and computer vision. With the popularity of deep learning, various deep semantic segmentation networks have been proposed for pixel-level classification and segmentation tasks. However, the imaging angles are often arbitrary in real world, such as water body images in remote sensing, and capillary and polyp images in medical field, and we usually cannot obtain prior orientation information to guide these networks to extract more effective features. Additionally, learning the features of objects with multiple orientation information is also challenging, as most CNN-based semantic segmentation networks do not have rotation equivariance to resist the disturbance from orientation information. To address the same, in this paper, we first establish a universal convolution-group framework to more fully utilize the orientation information and make the networks rotation equivariant. Then, we mathematically construct the padding-based rotation equivariant convolution mode (PreCM), which can be used not only for multi-scale images and convolution kernels, but also as a replacement component to replace multiple convolutions, like dilated convolution, transposed convolution, variable stride convolution, etc. In order to verify the realization of rotation equivariance, a new evaluation metric named rotation difference (RD) is finally proposed. The experiments carried out on the datesets Satellite Images of Water Bodies, DRIVE and Floodnet show that the PreCM-based networks can achieve better segmentation performance than the original and data augmentation-based networks. In terms of the average RD value, the former is 0% and the latter two are respectively 7.0503% and 3.2606%. Last but not least, PreCM also effectively enhances the robustness of networks to rotation perturbations.
Title:
Quantum Rationale-Aware Graph Contrastive Learning for Jet Discrimination
Authors: Md Abrar Jahin, Md. Akmol Masud, M. F. Mridha, Nilanjan Dey
Subjects: Subjects:
Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
Abstract
In high-energy physics, particle jet tagging plays a pivotal role in distinguishing quark from gluon jets using data from collider experiments. While graph-based deep learning methods have advanced this task beyond traditional feature-engineered approaches, the complex data structure and limited labeled samples present ongoing challenges. However, existing contrastive learning (CL) frameworks struggle to leverage rationale-aware augmentations effectively, often lacking supervision signals that guide the extraction of salient features and facing computational efficiency issues such as high parameter counts. In this study, we demonstrate that integrating a quantum rationale generator (QRG) within our proposed Quantum Rationale-aware Graph Contrastive Learning (QRGCL) framework significantly enhances jet discrimination performance, reducing reliance on labeled data and capturing discriminative features. Evaluated on the quark-gluon jet dataset, QRGCL achieves an AUC score of 77.53% while maintaining a compact architecture of only 45 QRG parameters, outperforming classical, quantum, and hybrid GCL and GNN benchmarks. These results highlight QRGCL's potential to advance jet tagging and other complex classification tasks in high-energy physics, where computational efficiency and feature extraction limitations persist.
Title:
Data Augmentations Go Beyond Encoding Invariances: A Theoretical Study on Self-Supervised Learning
Authors: Shlomo Libo Feigin, Maximilian Fleissner, Debarghya Ghoshdastidar
Abstract
Understanding the role of data augmentations is critical for applying Self-Supervised Learning (SSL) methods in new domains. Data augmentations are commonly understood as encoding invariances into the learned representations. This interpretation suggests that SSL would require diverse augmentations that resemble the original data. However, in practice, augmentations do not need to be similar to the original data nor be diverse, and can be neither at the same time. We provide a theoretical insight into this phenomenon. We show that for different SSL losses, any non-redundant representation can be learned with a single suitable augmentation. We provide an algorithm to reconstruct such augmentations and give insights into augmentation choices in SSL.
Title:
Learning predictable and robust neural representations by straightening image sequences
Authors: Xueyan Niu, Cristina Savin, Eero P. Simoncelli
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Prediction is a fundamental capability of all living organisms, and has been proposed as an objective for learning sensory representations. Recent work demonstrates that in primate visual systems, prediction is facilitated by neural representations that follow straighter temporal trajectories than their initial photoreceptor encoding, which allows for prediction by linear extrapolation. Inspired by these experimental findings, we develop a self-supervised learning (SSL) objective that explicitly quantifies and promotes straightening. We demonstrate the power of this objective in training deep feedforward neural networks on smoothly-rendered synthetic image sequences that mimic commonly-occurring properties of natural videos. The learned model contains neural embeddings that are predictive, but also factorize the geometric, photometric, and semantic attributes of objects. The representations also prove more robust to noise and adversarial attacks compared to previous SSL methods that optimize for invariance to random augmentations. Moreover, these beneficial properties can be transferred to other training procedures by using the straightening objective as a regularizer, suggesting a broader utility for straightening as a principle for robust unsupervised learning.
Title:
Improving Domain Generalization in Self-supervised Monocular Depth Estimation via Stabilized Adversarial Training
Authors: Yuanqi Yao, Gang Wu, Kui Jiang, Siao Liu, Jian Kuai, Xianming Liu, Junjun Jiang
Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Abstract
Learning a self-supervised Monocular Depth Estimation (MDE) model with great generalization remains significantly challenging. Despite the success of adversarial augmentation in the supervised learning generalization, naively incorporating it into self-supervised MDE models potentially causes over-regularization, suffering from severe performance degradation. In this paper, we conduct qualitative analysis and illuminate the main causes: (i) inherent sensitivity in the UNet-alike depth network and (ii) dual optimization conflict caused by over-regularization. To tackle these issues, we propose a general adversarial training framework, named Stabilized Conflict-optimization Adversarial Training (SCAT), integrating adversarial data augmentation into self-supervised MDE methods to achieve a balance between stability and generalization. Specifically, we devise an effective scaling depth network that tunes the coefficients of long skip connection and effectively stabilizes the training process. Then, we propose a conflict gradient surgery strategy, which progressively integrates the adversarial gradient and optimizes the model toward a conflict-free direction. Extensive experiments on five benchmarks demonstrate that SCAT can achieve state-of-the-art performance and significantly improve the generalization capability of existing self-supervised MDE methods.
Keyword: detection
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Keyword: face recognition
Title:
Title:
Keyword: augmentation
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title:
Title: