New submissions for Mon, 25 Mar 24

Keyword: detection

A Synergistic Approach to Wildfire Prevention and Management Using AI, ML, and 5G Technology in the United States

Authors: Authors: Stanley Chinedu Okoro, Alexander Lopez, Austine Unuriode
Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.14657
Pdf link: https://arxiv.org/pdf/2403.14657
Abstract Over the past few years, wildfires have become a worldwide environmental emergency, resulting in substantial harm to natural habitats and playing a part in the acceleration of climate change. Wildfire management methods involve prevention, response, and recovery efforts. Despite improvements in detection techniques, the rising occurrence of wildfires demands creative solutions for prompt identification and effective control. This research investigates proactive methods for detecting and handling wildfires in the United States, utilizing Artificial Intelligence (AI), Machine Learning (ML), and 5G technology. The specific objective of this research covers proactive detection and prevention of wildfires using advanced technology; Active monitoring and mapping with remote sensing and signaling leveraging on 5G technology; and Advanced response mechanisms to wildfire using drones and IOT devices. This study was based on secondary data collected from government databases and analyzed using descriptive statistics. In addition, past publications were reviewed through content analysis, and narrative synthesis was used to present the observations from various studies. The results showed that developing new technology presents an opportunity to detect and manage wildfires proactively. Utilizing advanced technology could save lives and prevent significant economic losses caused by wildfires. Various methods, such as AI-enabled remote sensing and 5G-based active monitoring, can enhance proactive wildfire detection and management. In addition, super intelligent drones and IOT devices can be used for safer responses to wildfires. This forms the core of the recommendation to the fire Management Agencies and the government.
Towards a Framework for Deep Learning Certification in Safety-Critical Applications Using Inherently Safe Design and Run-Time Error Detection
Authors: Authors: Romeo Valentin
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14678
Pdf link: https://arxiv.org/pdf/2403.14678
Abstract Although an ever-growing number of applications employ deep learning based systems for prediction, decision-making, or state estimation, almost no certification processes have been established that would allow such systems to be deployed in safety-critical applications. In this work we consider real-world problems arising in aviation and other safety-critical areas, and investigate their requirements for a certified model. To this end, we investigate methodologies from the machine learning research community aimed towards verifying robustness and reliability of deep learning systems, and evaluate these methodologies with regard to their applicability to real-world problems. Then, we establish a new framework towards deep learning certification based on (i) inherently safe design, and (ii) run-time error detection. Using a concrete use case from aviation, we show how deep learning models can recover disentangled variables through the use of weakly-supervised representation learning. We argue that such a system design is inherently less prone to common model failures, and can be verified to encode underlying mechanisms governing the data. Then, we investigate four techniques related to the run-time safety of a model, namely (i) uncertainty quantification, (ii) out-of-distribution detection, (iii) feature collapse, and (iv) adversarial attacks. We evaluate each for their applicability and formulate a set of desiderata that a certified model should fulfill. Finally, we propose a novel model structure that exhibits all desired properties discussed in this work, and is able to make regression and uncertainty predictions, as well as detect out-of-distribution inputs, while requiring no regression labels to train. We conclude with a discussion of the current state and expected future progress of deep learning certification, and its industrial and social implications.
An AIC-based approach for articulating unpredictable problems in open complex environments
Authors: Authors: Haider AL-Shareefy, Michael Butler, Thai Son Hoang
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2403.14697
Pdf link: https://arxiv.org/pdf/2403.14697
Abstract This research paper presents an approach to enhancing the predictive capability of architects in the design and assurance of systems, focusing on systems operating in dynamic and unpredictable environments. By adopting a systems approach, we aim to improve architects' predictive capabilities in designing dependable systems (for example, ML-based systems). An aerospace case study is used to illustrate the approach. Multiple factors (challenges) influencing aircraft detection are identified, demonstrating the effectiveness of our approach in a complex operational setting. Our approach primarily aimed to enhance the architect's predictive capability.
Rule based Complex Event Processing for an Air Quality Monitoring System in Smart City
Authors: Authors: Shashi Shekhar Kumar, Ritesh Chandra, Sonali Agarwal
Subjects: Computers and Society (cs.CY); Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2403.14701
Pdf link: https://arxiv.org/pdf/2403.14701
Abstract In recent years, smart city-based development has gained momentum due to its versatile nature in architecture and planning for the systematic habitation of human beings. According to World Health Organization (WHO) report, air pollution causes serious respiratory diseases. Hence, it becomes necessary to real-time monitoring of air quality to minimize effect by taking time-bound decisions by the stakeholders. The air pollution comprises various compositions such as NH3, O3, SO2, NO2, etc., and their concentrations vary from location to location.The research work proposes an integrated framework for monitoring air quality using rule-based Complex Event Processing (CEP) and SPARQL queries. CEP works with the data stream based on predefined rules to detect the complex pattern, which helps in decision support for stakeholders. Initially, the dataset was collected from the Central Pollution Control Board (CPCB) of India and this data was then preprocessed and passed through Apache Kafka. Then a knowledge graph developed based on the air quality paradigm. Consequently, convert preprocessed data into Resource Description Framework (RDF) data, and integrate with Knowledge graph which is ingested to CEP engine using Apache Jena for enhancing the decision support . Simultaneously, rules are extracted using a decision tree, and some ground truth parameters of CPCB are added and ingested to the CEP engine to determine the complex patterns. Consequently, the SPARQL query is used on real-time RDF dataset for fetching the condition of air quality as good, poor, severe, hazardous etc based on complex events detection. For validating the proposed approach various chunks of RDF are used for the deployment of events to the CEP engine, and its performance is examined over time while performing simple and complex queries.
Safeguarding Marketing Research: The Generation, Identification, and Mitigation of AI-Fabricated Disinformation
Authors: Authors: Anirban Mukherjee
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.14706
Pdf link: https://arxiv.org/pdf/2403.14706
Abstract Generative AI has ushered in the ability to generate content that closely mimics human contributions, introducing an unprecedented threat: Deployed en masse, these models can be used to manipulate public opinion and distort perceptions, resulting in a decline in trust towards digital platforms. This study contributes to marketing literature and practice in three ways. First, it demonstrates the proficiency of AI in fabricating disinformative user-generated content (UGC) that mimics the form of authentic content. Second, it quantifies the disruptive impact of such UGC on marketing research, highlighting the susceptibility of analytics frameworks to even minimal levels of disinformation. Third, it proposes and evaluates advanced detection frameworks, revealing that standard techniques are insufficient for filtering out AI-generated disinformation. We advocate for a comprehensive approach to safeguarding marketing research that integrates advanced algorithmic solutions, enhanced human oversight, and a reevaluation of regulatory and ethical frameworks. Our study seeks to serve as a catalyst, providing a foundation for future research and policy-making aimed at navigating the intricate challenges at the nexus of technology, ethics, and marketing.
Synergy of Information in Multimodal IoT Systems -- Discovering the impact of daily behaviour routines on physical activity level
Authors: Authors: Mohsen Shirali, Zahra Ahmadi, Carlos Fernández-Llatas, Jose-Luis Bayo-Monton
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2403.14707
Pdf link: https://arxiv.org/pdf/2403.14707
Abstract The intricate connection between daily behaviours and health necessitates robust behaviour monitoring, particularly with the advent of IoT systems. This study introduces an innovative approach, exploiting the synergy of information from various IoT sources, to assess the alignment of behaviour routines with health guidelines. We grouped routines based on guideline compliance and used a clustering method to identify similarities in behaviours and key characteristics within each cluster. Applied to an elderly care case study, our approach unveils patterns leading to physical inactivity by categorising days based on recommended daily steps. Utilising data from wristbands, smartphones, and ambient sensors, the study provides insights not achievable with single-source data. Visualisation in a calendar view aids health experts in understanding patient behaviours, enabling precise interventions. Notably, the approach facilitates early detection of behaviour changes during events like COVID-19 and Ramadan, available in our dataset. This work signifies a promising path for behavioural analysis and discovering variations to empower smart healthcare, offering insights into patient health, personalised interventions, and healthier routines through continuous IoT-driven data analysis.
Human-in-the-Loop AI for Cheating Ring Detection
Authors: Authors: Yong-Siang Shih, Manqian Liao, Ruidong Liu, Mirza Basim Baig
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14711
Pdf link: https://arxiv.org/pdf/2403.14711
Abstract Online exams have become popular in recent years due to their accessibility. However, some concerns have been raised about the security of the online exams, particularly in the context of professional cheating services aiding malicious test takers in passing exams, forming so-called "cheating rings". In this paper, we introduce a human-in-the-loop AI cheating ring detection system designed to detect and deter these cheating rings. We outline the underlying logic of this human-in-the-loop AI system, exploring its design principles tailored to achieve its objectives of detecting cheaters. Moreover, we illustrate the methodologies used to evaluate its performance and fairness, aiming to mitigate the unintended risks associated with the AI system. The design and development of the system adhere to Responsible AI (RAI) standards, ensuring that ethical considerations are integrated throughout the entire development process.
Bypassing LLM Watermarks with Color-Aware Substitutions
Authors: Authors: Qilong Wu, Varun Chandrasekaran
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14719
Pdf link: https://arxiv.org/pdf/2403.14719
Abstract Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (green'') tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the firstcolor-aware'' attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
A task of anomaly detection for a smart satellite Internet of things system
Authors: Authors: Zilong Shao
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2403.14738
Pdf link: https://arxiv.org/pdf/2403.14738
Abstract When the equipment is working, real-time collection of environmental sensor data for anomaly detection is one of the key links to prevent industrial process accidents and network attacks and ensure system security. However, under the environment with specific real-time requirements, the anomaly detection for environmental sensors still faces the following difficulties: (1) The complex nonlinear correlation characteristics between environmental sensor data variables lack effective expression methods, and the distribution between the data is difficult to be captured. (2) it is difficult to ensure the real-time monitoring requirements by using complex machine learning models, and the equipment cost is too high. (3) Too little sample data leads to less labeled data in supervised learning. This paper proposes an unsupervised deep learning anomaly detection system. Based on the generative adversarial network and self-attention mechanism, considering the different feature information contained in the local subsequences, it automatically learns the complex linear and nonlinear dependencies between environmental sensor variables, and uses the anomaly score calculation method combining reconstruction error and discrimination error. It can monitor the abnormal points of real sensor data with high real-time performance and can run on the intelligent satellite Internet of things system, which is suitable for the real working environment. Anomaly detection outperforms baseline methods in most cases and has good interpretability, which can be used to prevent industrial accidents and cyber-attacks for monitoring environmental sensors.
Relaxed Clique Percolation and Disinformation-Resilient Domains for Social Commerce Networks
Authors: Authors: Himangshu Paul, Alexander Nikolaev
Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2403.14767
Pdf link: https://arxiv.org/pdf/2403.14767
Abstract Must we trace and block all fake content in a social commerce network so that genuine users may enjoy fake-free information? Such efforts largely fail, because, as we get better at spam detection, spammers use the same advances for anti-detection. As a fundamentally new approach, we show that an online platform can aggregate and route user-generated content in a smart personalized way, which fosters and relies on "collective social responsibility". We introduce the notion of information aggregation domain, or simply, domain: composed for a given "central" node (user account), a domain is a connected set of nodes whose user-generated content is eligible to be used to meet the central node's information needs. Admitting malicious information sources - "bad citizen" nodes - into "good citizen" nodes' domains puts the good citizens at risk for disinformation attacks. We show how a platform can limit this risk by exploiting the social link structure between its nodes without the need to know which nodes are good or bad citizens. We introduce Relaxed Clique Percolation (RCP), a class of policies to compose personalized disinformation-resilient domains. Then, we define "RCP cores" and show how they can be used to efficiently compose resilient domains for all network nodes at once. Finally, we analyze the properties of RCP domains found in real-world social networks including Slashdot, Facebook, Flickr, and Yelp, to affirm that in practice, RCP domains turn out to be large and spatially diverse.
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
Authors: Authors: Bowen Jiang, Zhijun Zhuang, Shreyas S. Shivakumar, Dan Roth, Camillo J. Taylor
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2403.14783
Pdf link: https://arxiv.org/pdf/2403.14783
Abstract This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection
Authors: Authors: Gaurav Bhatt, James Ross, Leonid Sigal
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14797
Pdf link: https://arxiv.org/pdf/2403.14797
Abstract Modern pre-trained architectures struggle to retain previous information while undergoing continuous fine-tuning on new tasks. Despite notable progress in continual classification, systems designed for complex vision tasks such as detection or segmentation still struggle to attain satisfactory performance. In this work, we introduce a memory-based detection transformer architecture to adapt a pre-trained DETR-style detector to new tasks while preserving knowledge from previous tasks. We propose a novel localized query function for efficient information retrieval from memory units, aiming to minimize forgetting. Furthermore, we identify a fundamental challenge in continual detection referred to as background relegation. This arises when object categories from earlier tasks reappear in future tasks, potentially without labels, leading them to be implicitly treated as background. This is an inevitable issue in continual detection or segmentation. The introduced continual optimization technique effectively tackles this challenge. Finally, we assess the performance of our proposed system on continual detection benchmarks and demonstrate that our approach surpasses the performance of existing state-of-the-art resulting in 5-7% improvements on MS-COCO and PASCAL-VOC on the task of continual detection.
Deep Active Learning: A Reality Check
Authors: Authors: Edrina Gashi, Jiankang Deng, Ismail Elezi
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.14800
Pdf link: https://arxiv.org/pdf/2403.14800
Abstract We conduct a comprehensive evaluation of state-of-the-art deep active learning methods. Surprisingly, under general settings, no single-model method decisively outperforms entropy-based active learning, and some even fall short of random sampling. We delve into overlooked aspects like starting budget, budget step, and pretraining's impact, revealing their significance in achieving superior results. Additionally, we extend our evaluation to other tasks, exploring the active learning effectiveness in combination with semi-supervised learning, and object detection. Our experiments provide valuable insights and concrete recommendations for future active learning studies. By uncovering the limitations of current methods and understanding the impact of different experimental settings, we aim to inspire more efficient training of deep learning models in real-world scenarios with limited annotation budgets. This work contributes to advancing active learning's efficacy in deep learning and empowers researchers to make informed decisions when applying active learning to their tasks.
DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation
Authors: Authors: Zeeshan Hayder, Xuming He
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.14886
Pdf link: https://arxiv.org/pdf/2403.14886
Abstract Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image, which is challenging due to incomplete labelling, long-tailed relationship categories, and relational semantic overlap. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets and hence often suffer from limited capacity in learning low-frequency relationships. In this paper, we present a new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries. In particular, each graph-aware query encodes a compact representation of both the node and all of its relations in the graph, acquired through the utilization of a relaxed sub-graph matching during the training process. Moreover, to address the problem of relational semantic overlap, we utilize a strategy for relation distillation, aiming to efficiently learn multiple instances of semantic relationships. Extensive experiments on the VG and the PSG datasets show that our model achieves state-of-the-art results, showing a significant improvement of 3.5\% and 6.7\% in mR@50 and mR@100 for the scene-graph generation task and achieves an even more substantial improvement of 8.5\% and 10.3\% in mR@50 and mR@100 for the panoptic scene graph generation task. Code is available at \url{https://github.com/zeeshanhayder/DSGG}.
Unraveling Contagion Origins: Optimal Estimation through Maximum-Likelihood and Starlike Tree Approximation in Markovian Spreading Models
Authors: Authors: Pei-Duo Yu, Chee Wei Tan, Liang Zheng, Chao Zhao
Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2403.14890
Pdf link: https://arxiv.org/pdf/2403.14890
Abstract Identifying the source of epidemic-like spread in networks is crucial for tasks like removing internet viruses or finding the rumor source in online social networks. The challenge lies in tracing the source from a snapshot observation of infected nodes. How do we accurately pinpoint the source? Utilizing snapshot data, we apply a probabilistic approach, focusing on the graph boundary and the observed time, to detect sources via an effective maximum likelihood algorithm. A novel starlike tree approximation extends applicability to general graphs, demonstrating versatility. We highlight the utility of the Gamma function for analyzing the asymptotic behavior of the likelihood ratio between nodes. Comprehensive evaluations confirm algorithmic effectiveness in diverse network scenarios, advancing rumor source detection in large-scale network analysis and information dissemination strategies.
Stance Reasoner: Zero-Shot Stance Detection on Social Media with Explicit Reasoning
Authors: Authors: Maksym Taranukhin, Vered Shwartz, Evangelos Milios
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.14895
Pdf link: https://arxiv.org/pdf/2403.14895
Abstract Social media platforms are rich sources of opinionated content. Stance detection allows the automatic extraction of users' opinions on various topics from such content. We focus on zero-shot stance detection, where the model's success relies on (a) having knowledge about the target topic; and (b) learning general reasoning strategies that can be employed for new topics. We present Stance Reasoner, an approach to zero-shot stance detection on social media that leverages explicit reasoning over background knowledge to guide the model's inference about the document's stance on a target. Specifically, our method uses a pre-trained language model as a source of world knowledge, with the chain-of-thought in-context learning approach to generate intermediate reasoning steps. Stance Reasoner outperforms the current state-of-the-art models on 3 Twitter datasets, including fully supervised models. It can better generalize across targets, while at the same time providing explicit and interpretable explanations for its predictions.
Investigating Bias in LLM-Based Bias Detection: Disparities between LLMs and Human Perception
Authors: Authors: Luyang Lin, Lingzhi Wang, Jinsong Guo, Kam-Fai Wong
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2403.14896
Pdf link: https://arxiv.org/pdf/2403.14896
Abstract The pervasive spread of misinformation and disinformation in social media underscores the critical importance of detecting media bias. While robust Large Language Models (LLMs) have emerged as foundational tools for bias prediction, concerns about inherent biases within these models persist. In this work, we investigate the presence and nature of bias within LLMs and its consequential impact on media bias detection. Departing from conventional approaches that focus solely on bias detection in media content, we delve into biases within the LLM systems themselves. Through meticulous examination, we probe whether LLMs exhibit biases, particularly in political bias prediction and text continuation tasks. Additionally, we explore bias across diverse topics, aiming to uncover nuanced variations in bias expression within the LLM framework. Importantly, we propose debiasing strategies, including prompt engineering and model fine-tuning. Extensive analysis of bias tendencies across different LLMs sheds light on the broader landscape of bias propagation in language models. This study advances our understanding of LLM bias, offering critical insights into its implications for bias detection tasks and paving the way for more robust and equitable AI systems
Web-based Melanoma Detection
Authors: Authors: SangHyuk Kim, Edward Gaibor, Daniel Haehn
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14898
Pdf link: https://arxiv.org/pdf/2403.14898
Abstract Melanoma is the most aggressive form of skin cancer, and early detection can significantly increase survival rates and prevent cancer spread. However, developing reliable automated detection techniques is difficult due to the lack of standardized datasets and evaluation methods. This study introduces a unified melanoma classification approach that supports 54 combinations of 11 datasets and 24 state-of-the-art deep learning architectures. It enables a fair comparison of 1,296 experiments and results in a lightweight model deployable to the web-based MeshNet architecture named Mela-D. This approach can run up to 33x faster by reducing parameters 24x to yield an analogous 88.8\% accuracy comparable with ResNet50 on previously unseen images. This allows efficient and accurate melanoma detection in real-world settings that can run on consumer-level hardware.
Addressing Concept Shift in Online Time Series Forecasting: Detect-then-Adapt
Authors: Authors: YiFan Zhang, Weiqi Chen, Zhaoyang Zhu, Dalin Qin, Liang Sun, Xue Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14949
Pdf link: https://arxiv.org/pdf/2403.14949
Abstract Online updating of time series forecasting models aims to tackle the challenge of concept drifting by adjusting forecasting models based on streaming data. While numerous algorithms have been developed, most of them focus on model design and updating. In practice, many of these methods struggle with continuous performance regression in the face of accumulated concept drifts over time. To address this limitation, we present a novel approach, Concept \textbf{D}rift \textbf{D}etection an\textbf{D} \textbf{A}daptation (D3A), that first detects drifting conception and then aggressively adapts the current model to the drifted concepts after the detection for rapid adaption. To best harness the utility of historical data for model adaptation, we propose a data augmentation strategy introducing Gaussian noise into existing training instances. It helps mitigate the data distribution gap, a critical factor contributing to train-test performance inconsistency. The significance of our data augmentation process is verified by our theoretical analysis. Our empirical studies across six datasets demonstrate the effectiveness of D3A in improving model adaptation capability. Notably, compared to a simple Temporal Convolutional Network (TCN) baseline, D3A reduces the average Mean Squared Error (MSE) by $43.9\%$. For the state-of-the-art (SOTA) model, the MSE is reduced by $33.3\%$.
AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies
Authors: Authors: Rui Wang, Dengpan Ye, Long Tang, Yunming Zhang, Jiacheng Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.14974
Pdf link: https://arxiv.org/pdf/2403.14974
Abstract With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this paper, we propose AVT2-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an n-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves state-of-the-art performance intra- and cross-dataset Deepfake detection. Code is available at https://github.com/raining-dev/AVT2-DWF.
MasonTigers at SemEval-2024 Task 8: Performance Analysis of Transformer-based Models on Machine-Generated Text Detection
Authors: Authors: Sadiya Sayara Chowdhury Puspo, Md Nishat Raihan, Dhiman Goswami, Al Nahian Bin Emran, Amrita Ganguly, Ozlem Uzuner
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.14989
Pdf link: https://arxiv.org/pdf/2403.14989
Abstract This paper presents the MasonTigers entry to the SemEval-2024 Task 8 - Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. The task encompasses Binary Human-Written vs. Machine-Generated Text Classification (Track A), Multi-Way Machine-Generated Text Classification (Track B), and Human-Machine Mixed Text Detection (Track C). Our best performing approaches utilize mainly the ensemble of discriminator transformer models along with sentence transformer and statistical machine learning approaches in specific cases. Moreover, zero-shot prompting and fine-tuning of FLAN-T5 are used for Track A and B.
Cell Tracking according to Biological Needs -- Strong Mitosis-aware Random-finite Sets Tracker with Aleatoric Uncertainty
Authors: Authors: Timo Kaiser, Maximilian Schier, Bodo Rosenhahn
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15011
Pdf link: https://arxiv.org/pdf/2403.15011
Abstract Cell tracking and segmentation assist biologists in extracting insights from large-scale microscopy time-lapse data. Driven by local accuracy metrics, current tracking approaches often suffer from a lack of long-term consistency. To address this issue, we introduce an uncertainty estimation technique for neural tracking-by-regression frameworks and incorporate it into our novel extended Poisson multi-Bernoulli mixture tracker. Our uncertainty estimation identifies uncertain associations within high-performing tracking-by-regression methods using problem-specific test-time augmentations. Leveraging this uncertainty, along with a novel mitosis-aware assignment problem formulation, our tracker resolves false associations and mitosis detections stemming from long-term conflicts. We evaluate our approach on nine competitive datasets and demonstrate that it outperforms the current state-of-the-art on biologically relevant metrics substantially, achieving improvements by a factor of approximately $5.75$. Furthermore, we uncover new insights into the behavior of tracking-by-regression uncertainty.
Extracting Human Attention through Crowdsourced Patch Labeling
Authors: Authors: Minsuk Chang, Seokhyeon Park, Hyeon Jeon, Aeri Cho, Soohyun Lee, Jinwook Seo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2403.15013
Pdf link: https://arxiv.org/pdf/2403.15013
Abstract In image classification, a significant problem arises from bias in the datasets. When it contains only specific types of images, the classifier begins to rely on shortcuts - simplistic and erroneous rules for decision-making. This leads to high performance on the training dataset but inferior results on new, varied images, as the classifier's generalization capability is reduced. For example, if the images labeled as mustache consist solely of male figures, the model may inadvertently learn to classify images by gender rather than the presence of a mustache. One approach to mitigate such biases is to direct the model's attention toward the target object's location, usually marked using bounding boxes or polygons for annotation. However, collecting such annotations requires substantial time and human effort. Therefore, we propose a novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias. Our method consists of two steps. First, we extract the approximate location of a target using a pre-trained saliency detection model supplemented by human verification for accuracy. Then, we determine the human-attentive area in the image by iteratively dividing the image into smaller patches and employing crowdsourcing to ascertain whether each patch can be classified as the target object. We demonstrated the effectiveness of our method in mitigating bias through improved classification accuracy and the refined focus of the model. Also, crowdsourced experiments validate that our method collects human annotation up to 3.4 times faster than annotating object locations with polygons, significantly reducing the need for human resources. We conclude the paper by discussing the advantages of our method in a crowdsourcing context, mainly focusing on aspects of human errors and accessibility.
Vehicle Detection Performance in Nordic Region
Authors: Authors: Hamam Mokayed, Rajkumar Saini, Oluwatosin Adewumi, Lama Alkhaled, Bjorn Backe, Palaiahnakote Shivakumara, Olle Hagner, Yan Chai Hum
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.15017
Pdf link: https://arxiv.org/pdf/2403.15017
Abstract This paper addresses the critical challenge of vehicle detection in the harsh winter conditions in the Nordic regions, characterized by heavy snowfall, reduced visibility, and low lighting. Due to their susceptibility to environmental distortions and occlusions, traditional vehicle detection methods have struggled in these adverse conditions. The advanced proposed deep learning architectures brought promise, yet the unique difficulties of detecting vehicles in Nordic winters remain inadequately addressed. This study uses the Nordic Vehicle Dataset (NVD), which has UAV images from northern Sweden, to evaluate the performance of state-of-the-art vehicle detection algorithms under challenging weather conditions. Our methodology includes a comprehensive evaluation of single-stage, two-stage, and transformer-based detectors against the NVD. We propose a series of enhancements tailored to each detection framework, including data augmentation, hyperparameter tuning, transfer learning, and novel strategies designed explicitly for the DETR model. Our findings not only highlight the limitations of current detection systems in the Nordic environment but also offer promising directions for enhancing these algorithms for improved robustness and accuracy in vehicle detection amidst the complexities of winter landscapes. The code and the dataset are available at https://nvd.ltu-ai.dev
VRSO: Visual-Centric Reconstruction for Static Object Annotation
Authors: Authors: Chenyao Yu, Yingfeng Cai, Jiaxin Zhang, Hui Kong, Wei Sui, Cong Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15026
Pdf link: https://arxiv.org/pdf/2403.15026
Abstract As a part of the perception results of intelligent driving systems, static object detection (SOD) in 3D space provides crucial cues for driving environment understanding. With the rapid deployment of deep neural networks for SOD tasks, the demand for high-quality training samples soars. The traditional, also reliable, way is manual labeling over the dense LiDAR point clouds and reference images. Though most public driving datasets adopt this strategy to provide SOD ground truth (GT), it is still expensive (requires LiDAR scanners) and low-efficient (time-consuming and unscalable) in practice. This paper introduces VRSO, a visual-centric approach for static object annotation. VRSO is distinguished in low cost, high efficiency, and high quality: (1) It recovers static objects in 3D space with only camera images as input, and (2) manual labeling is barely involved since GT for SOD tasks is generated based on an automatic reconstruction and annotation pipeline. (3) Experiments on the Waymo Open Dataset show that the mean reprojection error from VRSO annotation is only 2.6 pixels, around four times lower than the Waymo labeling (10.6 pixels). Source code is available at: https://github.com/CaiYingFeng/VRSO.
An Integrated Neighborhood and Scale Information Network for Open-Pit Mine Change Detection in High-Resolution Remote Sensing Images
Authors: Authors: Zilin Xie, Kangning Li, Jinbao Jiang, Jinzhong Yang, Xiaojun Qiao, Deshuai Yuan, Cheng Nie
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15032
Pdf link: https://arxiv.org/pdf/2403.15032
Abstract Open-pit mine change detection (CD) in high-resolution (HR) remote sensing images plays a crucial role in mineral development and environmental protection. Significant progress has been made in this field in recent years, largely due to the advancement of deep learning techniques. However, existing deep-learning-based CD methods encounter challenges in effectively integrating neighborhood and scale information, resulting in suboptimal performance. Therefore, by exploring the influence patterns of neighborhood and scale information, this paper proposes an Integrated Neighborhood and Scale Information Network (INSINet) for open-pit mine CD in HR remote sensing images. Specifically, INSINet introduces 8-neighborhood-image information to acquire a larger receptive field, improving the recognition of center image boundary regions. Drawing on techniques of skip connection, deep supervision, and attention mechanism, the multi-path deep supervised attention (MDSA) module is designed to enhance multi-scale information fusion and change feature extraction. Experimental analysis reveals that incorporating neighborhood and scale information enhances the F1 score of INSINet by 6.40%, with improvements of 3.08% and 3.32% respectively. INSINet outperforms existing methods with an Overall Accuracy of 97.69%, Intersection over Union of 71.26%, and F1 score of 83.22%. INSINet shows significance for open-pit mine CD in HR remote sensing images.
Toward Tiny and High-quality Facial Makeup with Data Amplify Learning
Authors: Authors: Qiaoqiao Jin, Xuanhong Chen, Meiguang Jin, Ying Cheng, Rui Shi, Yucheng Zheng, Yupeng Zhu, Bingbing Ni
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15033
Pdf link: https://arxiv.org/pdf/2403.15033
Abstract Contemporary makeup approaches primarily hinge on unpaired learning paradigms, yet they grapple with the challenges of inaccurate supervision (e.g., face misalignment) and sophisticated facial prompts (including face parsing, and landmark detection). These challenges prohibit low-cost deployment of facial makeup models, especially on mobile devices. To solve above problems, we propose a brand-new learning paradigm, termed "Data Amplify Learning (DAL)," alongside a compact makeup model named "TinyBeauty." The core idea of DAL lies in employing a Diffusion-based Data Amplifier (DDA) to "amplify" limited images for the model training, thereby enabling accurate pixel-to-pixel supervision with merely a handful of annotations. Two pivotal innovations in DDA facilitate the above training approach: (1) A Residual Diffusion Model (RDM) is designed to generate high-fidelity detail and circumvent the detail vanishing problem in the vanilla diffusion models; (2) A Fine-Grained Makeup Module (FGMM) is proposed to achieve precise makeup control and combination while retaining face identity. Coupled with DAL, TinyBeauty necessitates merely 80K parameters to achieve a state-of-the-art performance without intricate face prompts. Meanwhile, TinyBeauty achieves a remarkable inference speed of up to 460 fps on the iPhone 13. Extensive experiments show that DAL can produce highly competitive makeup models using only 5 image pairs.
Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Authors: Authors: Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Sanghyun Seo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2403.15048
Pdf link: https://arxiv.org/pdf/2403.15048
Abstract Large-scale Text-to-Image (TTI) models have become a common approach for generating training data in various generative fields. However, visual hallucinations, which contain perceptually critical defects, remain a concern, especially in non-photorealistic styles like cartoon characters. We propose a novel visual hallucination detection system for cartoon character images generated by TTI models. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator, we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. This research advances TTI models by mitigating visual hallucinations, expanding their potential in non-photorealistic domains.
Rethinking 6-Dof Grasp Detection: A Flexible Framework for High-Quality Grasping
Authors: Authors: Wei Tang, Siang Chen, Pengwei Xie, Dingchang Hu, Wenming Yang, Guijin Wang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.15054
Pdf link: https://arxiv.org/pdf/2403.15054
Abstract Robotic grasping is a primitive skill for complex tasks and is fundamental to intelligence. For general 6-Dof grasping, most previous methods directly extract scene-level semantic or geometric information, while few of them consider the suitability for various downstream applications, such as target-oriented grasping. Addressing this issue, we rethink 6-Dof grasp detection from a grasp-centric view and propose a versatile grasp framework capable of handling both scene-level and target-oriented grasping. Our framework, FlexLoG, is composed of a Flexible Guidance Module and a Local Grasp Model. Specifically, the Flexible Guidance Module is compatible with both global (e.g., grasp heatmap) and local (e.g., visual grounding) guidance, enabling the generation of high-quality grasps across various tasks. The Local Grasp Model focuses on object-agnostic regional points and predicts grasps locally and intently. Experiment results reveal that our framework achieves over 18% and 23% improvement on unseen splits of the GraspNet-1Billion Dataset. Furthermore, real-world robotic tests in three distinct settings yield a 95% success rate.
Testing for Fault Diversity in Reinforcement Learning
Authors: Authors: Quentin Mazouni, Helge Spieker, Arnaud Gotlieb, Mathieu Acher
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2403.15065
Pdf link: https://arxiv.org/pdf/2403.15065
Abstract Reinforcement Learning is the premier technique to approach sequential decision problems, including complex tasks such as driving cars and landing spacecraft. Among the software validation and verification practices, testing for functional fault detection is a convenient way to build trustworthiness in the learned decision model. While recent works seek to maximise the number of detected faults, none consider fault characterisation during the search for more diversity. We argue that policy testing should not find as many failures as possible (e.g., inputs that trigger similar car crashes) but rather aim at revealing as informative and diverse faults as possible in the model. In this paper, we explore the use of quality diversity optimisation to solve the problem of fault diversity in policy testing. Quality diversity (QD) optimisation is a type of evolutionary algorithm to solve hard combinatorial optimisation problems where high-quality diverse solutions are sought. We define and address the underlying challenges of adapting QD optimisation to the test of action policies. Furthermore, we compare classical QD optimisers to state-of-the-art frameworks dedicated to policy testing, both in terms of search efficiency and fault diversity. We show that QD optimisation, while being conceptually simple and generally applicable, finds effectively more diverse faults in the decision model, and conclude that QD-based policy testing is a promising approach.
Real-time Threat Detection Strategies for Resource-constrained Devices
Authors: Authors: Mounia Hamidouche, Biniam Fisseha Demissie, Bilel Cherif
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2403.15078
Pdf link: https://arxiv.org/pdf/2403.15078
Abstract As more devices connect to the internet, it becomes crucial to address their limitations and basic security needs. While much research focuses on utilizing ML and DL to tackle security challenges, there is often a tendency to overlook the practicality and feasibility of implementing these methods in real-time settings. This oversight stems from the constrained processing power and memory of certain devices (IoT devices), as well as concerns about the generalizability of these approaches. Focusing on the detection of DNS-tunneling attacks in a router as a case study, we present an end-to-end process designed to effectively address these challenges. The process spans from developing a lightweight DNS-tunneling detection model to integrating it into a resource-constrained device for real-time detection. Through our experiments, we demonstrate that utilizing stateless features for training the ML model, along with features chosen to be independent of the network configuration, leads to highly accurate results. The deployment of this carefully crafted model, optimized for embedded devices across diverse environments, resulted in high DNS-tunneling attack detection with minimal latency. With this work, we aim to encourage solutions that strike a balance between theoretical advancements and the practical applicability of ML approaches in the ever-evolving landscape of device security.
Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection
Authors: Authors: Jiaming Li, Xiangru Lin, Wei Zhang, Xiao Tan, Yingying Li, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15127
Pdf link: https://arxiv.org/pdf/2403.15127
Abstract Current semi-supervised object detection (SSOD) algorithms typically assume class balanced datasets (PASCAL VOC etc.) or slightly class imbalanced datasets (MS-COCO, etc). This assumption can be easily violated since real world datasets can be extremely class imbalanced in nature, thus making the performance of semi-supervised object detectors far from satisfactory. Besides, the research for this problem in SSOD is severely under-explored. To bridge this research gap, we comprehensively study the class imbalance problem for SSOD under more challenging scenarios, thus forming the first experimental setting for class imbalanced SSOD (CI-SSOD). Moreover, we propose a simple yet effective gradient-based sampling framework that tackles the class imbalance problem from the perspective of two types of confirmation biases. To tackle confirmation bias towards majority classes, the gradient-based reweighting and gradient-based thresholding modules leverage the gradients from each class to fully balance the influence of the majority and minority classes. To tackle the confirmation bias from incorrect pseudo labels of minority classes, the class-rebalancing sampling module resamples unlabeled data following the guidance of the gradient-based reweighting module. Experiments on three proposed sub-tasks, namely MS-COCO, MS-COCO to Object365 and LVIS, suggest that our method outperforms current class imbalanced object detectors by clear margins, serving as a baseline for future research in CI-SSOD. Code will be available at https://github.com/nightkeepers/CI-SSOD.
An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning
Authors: Authors: Víctor Toscano-Durán, Javier Perera-Lago, Eduardo Paluzo-Hidalgo, Rocío Gonzalez-Diaz, Miguel Ángel Gutierrez-Naranjo, Matteo Rucco
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15150
Pdf link: https://arxiv.org/pdf/2403.15150
Abstract In recent years, Deep Learning has gained popularity for its ability to solve complex classification tasks, increasingly delivering better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure how similar are the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.
Exploring the Task-agnostic Trait of Self-supervised Learning in the Context of Detecting Mental Disorders
Authors: Authors: Rohan Kumar Gupta, Rohit Sinha
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2403.15170
Pdf link: https://arxiv.org/pdf/2403.15170
Abstract Self-supervised learning (SSL) has been investigated to generate task-agnostic representations across various domains. However, such investigation has not been conducted for detecting multiple mental disorders. The rationale behind the existence of a task-agnostic representation lies in the overlapping symptoms among multiple mental disorders. Consequently, the behavioural data collected for mental health assessment may carry a mixed bag of attributes related to multiple disorders. Motivated by that, in this study, we explore a task-agnostic representation derived through SSL in the context of detecting major depressive disorder (MDD) and post-traumatic stress disorder (PTSD) using audio and video data collected during interactive sessions. This study employs SSL models trained by predicting multiple fixed targets or masked frames. We propose a list of fixed targets to make the generated representation more efficient for detecting MDD and PTSD. Furthermore, we modify the hyper-parameters of the SSL encoder predicting fixed targets to generate global representations that capture varying temporal contexts. Both these innovations are noted to yield improved detection performances for considered mental disorders and exhibit task-agnostic traits. In the context of the SSL model predicting masked frames, the generated global representations are also noted to exhibit task-agnostic traits.
CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition
Authors: Authors: Shaowei Fu, Yifan Duan, Yao Li, Chengzhen Meng, Yingjie Wang, Jianmin Ji, Yanyong Zhang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.15183
Pdf link: https://arxiv.org/pdf/2403.15183
Abstract The integration of complementary characteristics from camera and radar data has emerged as an effective approach in 3D object detection. However, such fusion-based methods remain unexplored for place recognition, an equally important task for autonomous systems. Given that place recognition relies on the similarity between a query scene and the corresponding candidate scene, the stationary background of a scene is expected to play a crucial role in the task. As such, current well-designed camera-radar fusion methods for 3D object detection can hardly take effect in place recognition because they mainly focus on dynamic foreground objects. In this paper, a background-attentive camera-radar fusion-based method, named CRPlace, is proposed to generate background-attentive global descriptors from multi-view images and radar point clouds for accurate place recognition. To extract stationary background features effectively, we design an adaptive module that generates the background-attentive mask by utilizing the camera BEV feature and radar dynamic points. With the guidance of a background mask, we devise a bidirectional cross-attention-based spatial fusion strategy to facilitate comprehensive spatial interaction between the background information of the camera BEV feature and the radar BEV feature. As the first camera-radar fusion-based place recognition network, CRPlace has been evaluated thoroughly on the nuScenes dataset. The results show that our algorithm outperforms a variety of baseline methods across a comprehensive set of metrics (recall@1 reaches 91.2%).
SFOD: Spiking Fusion Object Detector
Authors: Authors: Yimeng Fan, Wei Zhang, Changsong Liu, Mingyang Li, Wenrui Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.15192
Pdf link: https://arxiv.org/pdf/2403.15192
Abstract Event cameras, characterized by high temporal resolution, high dynamic range, low power consumption, and high pixel bandwidth, offer unique capabilities for object detection in specialized contexts. Despite these advantages, the inherent sparsity and asynchrony of event data pose challenges to existing object detection algorithms. Spiking Neural Networks (SNNs), inspired by the way the human brain codes and processes information, offer a potential solution to these difficulties. However, their performance in object detection using event cameras is limited in current implementations. In this paper, we propose the Spiking Fusion Object Detector (SFOD), a simple and efficient approach to SNN-based object detection. Specifically, we design a Spiking Fusion Module, achieving the first-time fusion of feature maps from different scales in SNNs applied to event cameras. Additionally, through integrating our analysis and experiments conducted during the pretraining of the backbone network on the NCAR dataset, we delve deeply into the impact of spiking decoding strategies and loss functions on model performance. Thereby, we establish state-of-the-art classification results based on SNNs, achieving 93.7\% accuracy on the NCAR dataset. Experimental results on the GEN1 detection dataset demonstrate that the SFOD achieves a state-of-the-art mAP of 32.1\%, outperforming existing SNN-based approaches. Our research not only underscores the potential of SNNs in object detection with event cameras but also propels the advancement of SNNs. Code is available at https://github.com/yimeng-fan/SFOD.
Cryptic Bytes: WebAssembly Obfuscation for Evading Cryptojacking Detection
Authors: Authors: Håkon Harnes, Donn Morrison
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2403.15197
Pdf link: https://arxiv.org/pdf/2403.15197
Abstract WebAssembly has gained significant traction as a high-performance, secure, and portable compilation target for the Web and beyond. However, its growing adoption has also introduced new security challenges. One such threat is cryptojacking, where websites mine cryptocurrencies on visitors' devices without their knowledge or consent, often through the use of WebAssembly. While detection methods have been proposed, research on circumventing them remains limited. In this paper, we present the most comprehensive evaluation of code obfuscation techniques for WebAssembly to date, assessing their effectiveness, detectability, and overhead across multiple abstraction levels. We obfuscate a diverse set of applications, including utilities, games, and crypto miners, using state-of-the-art obfuscation tools like Tigress and wasm-mutate, as well as our novel tool, emcc-obf. Our findings suggest that obfuscation can effectively produce dissimilar WebAssembly binaries, with Tigress proving most effective, followed by emcc-obf and wasm-mutate. The impact on the resulting native code is also significant, although the V8 engine's TurboFan optimizer can reduce native code size by 30\% on average. Notably, we find that obfuscation can successfully evade state-of-the-art cryptojacking detectors. Although obfuscation can introduce substantial performance overheads, we demonstrate how obfuscation can be used for evading detection with minimal overhead in real-world scenarios by strategically applying transformations. These insights are valuable for researchers, providing a foundation for developing more robust detection methods. Additionally, we make our dataset of over 20,000 obfuscated WebAssembly binaries and the emcc-obf tool publicly available to stimulate further research.
DITTO: Demonstration Imitation by Trajectory Transformation
Authors: Authors: Nick Heppert, Max Argus, Tim Welschehold, Thomas Brox, Abhinav Valada
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15203
Pdf link: https://arxiv.org/pdf/2403.15203
Abstract Teaching robots new skills quickly and conveniently is crucial for the broader adoption of robotic systems. In this work, we address the problem of one-shot imitation from a single human demonstration, given by an RGB-D video recording through a two-stage process. In the first stage which is offline, we extract the trajectory of the demonstration. This entails segmenting manipulated objects and determining their relative motion in relation to secondary objects such as containers. Subsequently, in the live online trajectory generation stage, we first \mbox{re-detect} all objects, then we warp the demonstration trajectory to the current scene, and finally, we trace the trajectory with the robot. To complete these steps, our method makes leverages several ancillary models, including those for segmentation, relative object pose estimation, and grasp prediction. We systematically evaluate different combinations of correspondence and re-detection methods to validate our design decision across a diverse range of tasks. Specifically, we collect demonstrations of ten different tasks including pick-and-place tasks as well as articulated object manipulation. Finally, we perform extensive evaluations on a real robot system to demonstrate the effectiveness and utility of our approach in real-world scenarios. We make the code publicly available at this http URL
MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
Authors: Authors: Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15209
Pdf link: https://arxiv.org/pdf/2403.15209
Abstract Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in obvious cases, especially due to the modality bias learned from statistically biased datasets. From these problems, we anticipate that maybe understanding the complementary information itself is difficult to achieve from vision-only models. Accordingly, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework, which incorporates Large Language Models (LLMs) to understand the complementary information at the semantic level and further enhance the fusion process. Specifically, we generate text descriptions of the pedestrian in each RGB and thermal modality and design a Multispectral Chain-of-Thought (MSCoT) prompting, which models a step-by-step process to facilitate cross-modal reasoning at the semantic level and perform accurate detection. Moreover, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing vision-driven and language-driven detections. Extensive experiments validate that MSCoTDet improves multispectral pedestrian detection.
InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection
Authors: Authors: Thales Bertaglia, Lily Heisig, Rishabh Kaushal, Adriana Iamnitchi
Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2403.15214
Pdf link: https://arxiv.org/pdf/2403.15214
Abstract Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data that is useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.
TriHelper: Zero-Shot Object Navigation with Dynamic Assistance
Authors: Authors: Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, Renjing Xu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.15223
Pdf link: https://arxiv.org/pdf/2403.15223
Abstract Navigating toward specific objects in unknown environments without additional training, known as Zero-Shot object navigation, poses a significant challenge in the field of robotics, which demands high levels of auxiliary information and strategic planning. Traditional works have focused on holistic solutions, overlooking the specific challenges agents encounter during navigation such as collision, low exploration efficiency, and misidentification of targets. To address these challenges, our work proposes TriHelper, a novel framework designed to assist agents dynamically through three primary navigation challenges: collision, exploration, and detection. Specifically, our framework consists of three innovative components: (i) Collision Helper, (ii) Exploration Helper, and (iii) Detection Helper. These components work collaboratively to solve these challenges throughout the navigation process. Experiments on the Habitat-Matterport 3D (HM3D) and Gibson datasets demonstrate that TriHelper significantly outperforms all existing baseline methods in Zero-Shot object navigation, showcasing superior success rates and exploration efficiency. Our ablation studies further underscore the effectiveness of each helper in addressing their respective challenges, notably enhancing the agent's navigation capabilities. By proposing TriHelper, we offer a fresh perspective on advancing the object navigation task, paving the way for future research in the domain of Embodied AI and visual-based navigation.
Information Rates of Successive Interference Cancellation for Optical Fiber
Authors: Authors: Alex Jäger, Gerhard Kramer
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2403.15240
Pdf link: https://arxiv.org/pdf/2403.15240
Abstract Successive interference cancellation (SIC) is used to approach the achievable information rates (AIRs) of joint detection and decoding for long-haul optical fiber links. The AIRs of memoryless ring constellations are compared to those of circularly symmetric complex Gaussian modulation for surrogate channel models with correlated phase noise. Simulations are performed for 1000 km of standard single-mode fiber with ideal Raman amplification. In this setup, 32 rings and 16 SIC-stages with Gaussian message-passing receivers achieve the AIR peaks of previous work. The computational complexity scales in proportion to the number of SIC-stages, where one stage has the complexity of separate detection and decoding.
IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
Authors: Authors: Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, Wenguan Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15241
Pdf link: https://arxiv.org/pdf/2403.15241
Abstract Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at: https://github.com/yinjunbo/IS-Fusion.
Hyperbolic Metric Learning for Visual Outlier Detection
Authors: Authors: Alvaro Gonzalez-Jimenez, Simone Lionetti, Dena Bazazian, Philippe Gottfrois, Fabian Gröger, Marc Pouly, Alexander Navarini
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15260
Pdf link: https://arxiv.org/pdf/2403.15260
Abstract Out-Of-Distribution (OOD) detection is critical to deploy deep learning models in safety-critical applications. However, the inherent hierarchical concept structure of visual data, which is instrumental to OOD detection, is often poorly captured by conventional methods based on Euclidean geometry. This work proposes a metric framework that leverages the strengths of Hyperbolic geometry for OOD detection. Inspired by previous works that refine the decision boundary for OOD data with synthetic outliers, we extend this method to Hyperbolic space. Interestingly, we find that synthetic outliers do not benefit OOD detection in Hyperbolic space as they do in Euclidean space. Furthermore we explore the relationship between OOD detection performance and Hyperbolic embedding dimension, addressing practical concerns in resource-constrained environments. Extensive experiments show that our framework improves the FPR95 for OOD detection from 22\% to 15\% and from 49% to 28% on CIFAR-10 and CIFAR-100 respectively compared to Euclidean methods.
CR3DT: Camera-RADAR Fusion for 3D Detection and Tracking
Authors: Authors: Nicolas Baumann, Michael Baumgartner, Edoardo Ghignone, Jonas Kühne, Tobias Fischer, Yung-Hsu Yang, Marc Pollefeys, Michele Magno
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.15313
Pdf link: https://arxiv.org/pdf/2403.15313
Abstract Accurate detection and tracking of surrounding objects is essential to enable self-driving vehicles. While Light Detection and Ranging (LiDAR) sensors have set the benchmark for high performance, the appeal of camera-only solutions lies in their cost-effectiveness. Notably, despite the prevalent use of Radio Detection and Ranging (RADAR) sensors in automotive systems, their potential in 3D detection and tracking has been largely disregarded due to data sparsity and measurement noise. As a recent development, the combination of RADARs and cameras is emerging as a promising solution. This paper presents Camera-RADAR 3D Detection and Tracking (CR3DT), a camera-RADAR fusion model for 3D object detection, and Multi-Object Tracking (MOT). Building upon the foundations of the State-of-the-Art (SotA) camera-only BEVDet architecture, CR3DT demonstrates substantial improvements in both detection and tracking capabilities, by incorporating the spatial and velocity information of the RADAR sensor. Experimental results demonstrate an absolute improvement in detection performance of 5.3% in mean Average Precision (mAP) and a 14.9% increase in Average Multi-Object Tracking Accuracy (AMOTA) on the nuScenes dataset when leveraging both modalities. CR3DT bridges the gap between high-performance and cost-effective perception systems in autonomous driving, by capitalizing on the ubiquitous presence of RADAR in automotive applications.
Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection
Authors: Authors: Hongzhi Gao, Zheng Chen, Zehui Chen, Lin Chen, Jiaming Liu, Shanghang Zhang, Feng Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.15317
Pdf link: https://arxiv.org/pdf/2403.15317
Abstract Training high-accuracy 3D detectors necessitates massive labeled 3D annotations with 7 degree-of-freedom, which is laborious and time-consuming. Therefore, the form of point annotations is proposed to offer significant prospects for practical applications in 3D detection, which is not only more accessible and less expensive but also provides strong spatial information for object localization.In this paper, we empirically discover that it is non-trivial to merely adapt Point-DETR to its 3D form, encountering two main bottlenecks: 1) it fails to encode strong 3D prior into the model, and 2) it generates low-quality pseudo labels in distant regions due to the extreme sparsity of LiDAR points. To overcome these challenges, we introduce Point-DETR3D, a teacher-student framework for weakly semi-supervised 3D detection, designed to fully capitalize on point-wise supervision within a constrained instance-wise annotation budget.Different from Point-DETR which encodes 3D positional information solely through a point encoder, we propose an explicit positional query initialization strategy to enhance the positional prior. Considering the low quality of pseudo labels at distant regions produced by the teacher model, we enhance the detector's perception by incorporating dense imagery data through a novel Cross-Modal Deformable RoI Fusion (D-RoI).Moreover, an innovative point-guided self-supervised learning technique is proposed to allow for fully exploiting point priors, even in student models.Extensive experiments on representative nuScenes dataset demonstrate our Point-DETR3D obtains significant improvements compared to previous works. Notably, with only 5% of labeled data, Point-DETR3D achieves over 90% performance of its fully supervised counterpart.
Gesture-Controlled Aerial Robot Formation for Human-Swarm Interaction in Safety Monitoring Applications
Authors: Authors: Vít Krátký, Giuseppe Silano, Matouš Vrba, Christos Papaioannidis, Ioannis Mademlis, Robert Pěnička, Ioannis Pitas, Martin Saska
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.15333
Pdf link: https://arxiv.org/pdf/2403.15333
Abstract This paper presents a formation control approach for contactless gesture-based Human-Swarm Interaction (HSI) between a team of multi-rotor Unmanned Aerial Vehicles (UAVs) and a human worker. The approach is intended for monitoring the safety of human workers, especially those working at heights. In the proposed dynamic formation scheme, one UAV acts as the leader of the formation and is equipped with sensors for human worker detection and gesture recognition. The follower UAVs maintain a predetermined formation relative to the worker's position, thereby providing additional perspectives of the monitored scene. Hand gestures allow the human worker to specify movements and action commands for the UAV team and initiate other mission-related commands without the need for an additional communication channel or specific markers. Together with a novel unified human detection and tracking algorithm, human pose estimation approach and gesture detection pipeline, the proposed approach forms a first instance of an HSI system incorporating all these modules onboard real-world UAVs. Simulations and field experiments with three UAVs and a human worker in a mock-up scenario showcase the effectiveness and responsiveness of the proposed approach.
Towards Knowledge-Grounded Natural Language Understanding and Generation
Authors: Authors: Chenxi Whitehouse
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.15364
Pdf link: https://arxiv.org/pdf/2403.15364
Abstract This thesis investigates how natural language understanding and generation with transformer models can benefit from grounding the models with knowledge representations and addresses the following key research questions: (i) Can knowledge of entities extend its benefits beyond entity-centric tasks, such as entity linking? (ii) How can we faithfully and effectively extract such structured knowledge from raw text, especially noisy web text? (iii) How do other types of knowledge, beyond structured knowledge, contribute to improving NLP tasks? Studies in this thesis find that incorporating relevant and up-to-date knowledge of entities benefits fake news detection, and entity-focused code-switching significantly enhances zero-shot cross-lingual transfer on entity-centric tasks. In terms of effective and faithful approaches to extracting structured knowledge, it is observed that integrating negative examples and training with entity planning significantly improves performance. Additionally, it is established that other general forms of knowledge, such as parametric and distilled knowledge, enhance multimodal and multilingual knowledge-intensive tasks. This research shows the tangible benefits of diverse knowledge integration and motivates further exploration in this direction.
A Transfer Attack to Image Watermarks
Authors: Authors: Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong
Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.15365
Pdf link: https://arxiv.org/pdf/2403.15365
Abstract Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In particular, multiple studies claimed that image watermark is robust in such setting. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API.
Keyword: face recognition

KeyPoint Relative Position Encoding for Face Recognition
Authors: Authors: Minchul Kim, Yiyang Su, Feng Liu, Anil Jain, Xiaoming Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.14852
Pdf link: https://arxiv.org/pdf/2403.14852
Abstract In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available.
Cell Variational Information Bottleneck Network
Authors: Authors: Zhonghua Zhai, Chen Ju, Jinsong Lan, Shuai Xiao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15082
Pdf link: https://arxiv.org/pdf/2403.15082
Abstract In this work, we propose Cell Variational Information Bottleneck Network (cellVIB), a convolutional neural network using information bottleneck mechanism, which can be combined with the latest feedforward network architecture in an end-to-end training method. Our Cell Variational Information Bottleneck Network is constructed by stacking VIB cells, which generate feature maps with uncertainty. As layers going deeper, the regularization effect will gradually increase, instead of directly adding excessive regular constraints to the output layer of the model as in Deep VIB. Under each VIB cell, the feedforward process learns an independent mean term and an standard deviation term, and predicts the Gaussian distribution based on them. The feedback process is based on reparameterization trick for effective training. This work performs an extensive analysis on MNIST dataset to verify the effectiveness of each VIB cells, and provides an insightful analysis on how the VIB cells affect mutual information. Experiments conducted on CIFAR-10 also prove that our cellVIB is robust against noisy labels during training and against corrupted images during testing. Then, we validate our method on PACS dataset, whose results show that the VIB cells can significantly improve the generalization performance of the basic model. Finally, in a more complex representation learning task, face recognition, our network structure has also achieved very competitive results.
Keyword: augmentation

Robust Model Based Reinforcement Learning Using $\mathcal{L}_1$ Adaptive Control
Authors: Authors: Minjun Sung, Sambhu H. Karumanchi, Aditya Gahlawat, Naira Hovakimyan
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14860
Pdf link: https://arxiv.org/pdf/2403.14860
Abstract We introduce $\mathcal{L}_1$-MBRL, a control-theoretic augmentation scheme for Model-Based Reinforcement Learning (MBRL) algorithms. Unlike model-free approaches, MBRL algorithms learn a model of the transition function using data and use it to design a control input. Our approach generates a series of approximate control-affine models of the learned transition function according to the proposed switching law. Using the approximate model, control input produced by the underlying MBRL is perturbed by the $\mathcal{L}_1$ adaptive control, which is designed to enhance the robustness of the system against uncertainties. Importantly, this approach is agnostic to the choice of MBRL algorithm, enabling the use of the scheme with various MBRL algorithms. MBRL algorithms with $\mathcal{L}_1$ augmentation exhibit enhanced performance and sample efficiency across multiple MuJoCo environments, outperforming the original MBRL algorithms, both with and without system noise.
Addressing Concept Shift in Online Time Series Forecasting: Detect-then-Adapt
Authors: Authors: YiFan Zhang, Weiqi Chen, Zhaoyang Zhu, Dalin Qin, Liang Sun, Xue Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.14949
Pdf link: https://arxiv.org/pdf/2403.14949
Abstract Online updating of time series forecasting models aims to tackle the challenge of concept drifting by adjusting forecasting models based on streaming data. While numerous algorithms have been developed, most of them focus on model design and updating. In practice, many of these methods struggle with continuous performance regression in the face of accumulated concept drifts over time. To address this limitation, we present a novel approach, Concept \textbf{D}rift \textbf{D}etection an\textbf{D} \textbf{A}daptation (D3A), that first detects drifting conception and then aggressively adapts the current model to the drifted concepts after the detection for rapid adaption. To best harness the utility of historical data for model adaptation, we propose a data augmentation strategy introducing Gaussian noise into existing training instances. It helps mitigate the data distribution gap, a critical factor contributing to train-test performance inconsistency. The significance of our data augmentation process is verified by our theoretical analysis. Our empirical studies across six datasets demonstrate the effectiveness of D3A in improving model adaptation capability. Notably, compared to a simple Temporal Convolutional Network (TCN) baseline, D3A reduces the average Mean Squared Error (MSE) by $43.9\%$. For the state-of-the-art (SOTA) model, the MSE is reduced by $33.3\%$.
Cell Tracking according to Biological Needs -- Strong Mitosis-aware Random-finite Sets Tracker with Aleatoric Uncertainty
Authors: Authors: Timo Kaiser, Maximilian Schier, Bodo Rosenhahn
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.15011
Pdf link: https://arxiv.org/pdf/2403.15011
Abstract Cell tracking and segmentation assist biologists in extracting insights from large-scale microscopy time-lapse data. Driven by local accuracy metrics, current tracking approaches often suffer from a lack of long-term consistency. To address this issue, we introduce an uncertainty estimation technique for neural tracking-by-regression frameworks and incorporate it into our novel extended Poisson multi-Bernoulli mixture tracker. Our uncertainty estimation identifies uncertain associations within high-performing tracking-by-regression methods using problem-specific test-time augmentations. Leveraging this uncertainty, along with a novel mitosis-aware assignment problem formulation, our tracker resolves false associations and mitosis detections stemming from long-term conflicts. We evaluate our approach on nine competitive datasets and demonstrate that it outperforms the current state-of-the-art on biologically relevant metrics substantially, achieving improvements by a factor of approximately $5.75$. Furthermore, we uncover new insights into the behavior of tracking-by-regression uncertainty.
Vehicle Detection Performance in Nordic Region
Authors: Authors: Hamam Mokayed, Rajkumar Saini, Oluwatosin Adewumi, Lama Alkhaled, Bjorn Backe, Palaiahnakote Shivakumara, Olle Hagner, Yan Chai Hum
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.15017
Pdf link: https://arxiv.org/pdf/2403.15017
Abstract This paper addresses the critical challenge of vehicle detection in the harsh winter conditions in the Nordic regions, characterized by heavy snowfall, reduced visibility, and low lighting. Due to their susceptibility to environmental distortions and occlusions, traditional vehicle detection methods have struggled in these adverse conditions. The advanced proposed deep learning architectures brought promise, yet the unique difficulties of detecting vehicles in Nordic winters remain inadequately addressed. This study uses the Nordic Vehicle Dataset (NVD), which has UAV images from northern Sweden, to evaluate the performance of state-of-the-art vehicle detection algorithms under challenging weather conditions. Our methodology includes a comprehensive evaluation of single-stage, two-stage, and transformer-based detectors against the NVD. We propose a series of enhancements tailored to each detection framework, including data augmentation, hyperparameter tuning, transfer learning, and novel strategies designed explicitly for the DETR model. Our findings not only highlight the limitations of current detection systems in the Nordic environment but also offer promising directions for enhancing these algorithms for improved robustness and accuracy in vehicle detection amidst the complexities of winter landscapes. The code and the dataset are available at https://nvd.ltu-ai.dev
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Authors: Authors: Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.15042
Pdf link: https://arxiv.org/pdf/2403.15042
Abstract Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a LLaMA2-7B student model.
Transfer CLIP for Generalizable Image Denoising
Authors: Authors: Jun Cheng, Dong Liang, Shan Tan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2403.15132
Pdf link: https://arxiv.org/pdf/2403.15132
Abstract Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet, the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties, which are highly desirable for generalizable denoising. Leveraging these properties, we devise an asymmetrical encoder-decoder denoising network, which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises, including synthetic noise, real-world sRGB noise, and low-dose CT image noise, demonstrate the superior generalization ability of our method.
Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion
Authors: Authors: Sofia Casarin, Cynthia I. Ugwu, Sergio Escalera, Oswald Lanz
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.15194
Pdf link: https://arxiv.org/pdf/2403.15194
Abstract The landscape of deep learning research is moving towards innovative strategies to harness the true potential of data. Traditionally, emphasis has been on scaling model architectures, resulting in large and complex neural networks, which can be difficult to train with limited computational resources. However, independently of the model size, data quality (i.e. amount and variability) is still a major factor that affects model generalization. In this work, we propose a novel technique to exploit available data through the use of automatic data augmentation for the tasks of image classification and semantic segmentation. We introduce the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos. Compared to previous approaches, DAS is extremely fast and flexible, allowing the search on very large search spaces in less than a GPU day. Our intuition is that the increased receptive field in the temporal dimension provided by DAS could lead to benefits also to the spatial receptive field. More specifically, we leverage DAS to guide the reshaping of the spatial receptive field by selecting task-dependant transformations. As a result, compared to standard augmentation alternatives, we improve in terms of accuracy on ImageNet, Cifar10, Cifar100, Tiny-ImageNet, Pascal-VOC-2012 and CityScapes datasets when plugging-in our DAS over different light-weight video backbones.

LeeKyungwook / get-arxiv-noti

New submissions for Mon, 25 Mar 24 #1032

Keyword: detection

A Synergistic Approach to Wildfire Prevention and Management Using AI, ML, and 5G Technology in the United States

Towards a Framework for Deep Learning Certification in Safety-Critical Applications Using Inherently Safe Design and Run-Time Error Detection

An AIC-based approach for articulating unpredictable problems in open complex environments

Rule based Complex Event Processing for an Air Quality Monitoring System in Smart City

Safeguarding Marketing Research: The Generation, Identification, and Mitigation of AI-Fabricated Disinformation

Synergy of Information in Multimodal IoT Systems -- Discovering the impact of daily behaviour routines on physical activity level

Human-in-the-Loop AI for Cheating Ring Detection

Bypassing LLM Watermarks with Color-Aware Substitutions

A task of anomaly detection for a smart satellite Internet of things system

Relaxed Clique Percolation and Disinformation-Resilient Domains for Social Commerce Networks

Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering

Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection

Deep Active Learning: A Reality Check

DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation

Unraveling Contagion Origins: Optimal Estimation through Maximum-Likelihood and Starlike Tree Approximation in Markovian Spreading Models

Stance Reasoner: Zero-Shot Stance Detection on Social Media with Explicit Reasoning

Investigating Bias in LLM-Based Bias Detection: Disparities between LLMs and Human Perception

Web-based Melanoma Detection

Addressing Concept Shift in Online Time Series Forecasting: Detect-then-Adapt

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

MasonTigers at SemEval-2024 Task 8: Performance Analysis of Transformer-based Models on Machine-Generated Text Detection

Cell Tracking according to Biological Needs -- Strong Mitosis-aware Random-finite Sets Tracker with Aleatoric Uncertainty

Extracting Human Attention through Crowdsourced Patch Labeling

Vehicle Detection Performance in Nordic Region

VRSO: Visual-Centric Reconstruction for Static Object Annotation

An Integrated Neighborhood and Scale Information Network for Open-Pit Mine Change Detection in High-Resolution Remote Sensing Images

Toward Tiny and High-quality Facial Makeup with Data Amplify Learning

Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning

Rethinking 6-Dof Grasp Detection: A Flexible Framework for High-Quality Grasping

Testing for Fault Diversity in Reinforcement Learning

Real-time Threat Detection Strategies for Resource-constrained Devices

Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection

An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning

Exploring the Task-agnostic Trait of Self-supervised Learning in the Context of Detecting Mental Disorders

CRPlace: Camera-Radar Fusion with BEV Representation for Place Recognition

SFOD: Spiking Fusion Object Detector

Cryptic Bytes: WebAssembly Obfuscation for Evading Cryptojacking Detection

DITTO: Demonstration Imitation by Trajectory Transformation

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection

TriHelper: Zero-Shot Object Navigation with Dynamic Assistance

Information Rates of Successive Interference Cancellation for Optical Fiber

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Hyperbolic Metric Learning for Visual Outlier Detection

CR3DT: Camera-RADAR Fusion for 3D Detection and Tracking

Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

Gesture-Controlled Aerial Robot Formation for Human-Swarm Interaction in Safety Monitoring Applications

Towards Knowledge-Grounded Natural Language Understanding and Generation

A Transfer Attack to Image Watermarks

Keyword: face recognition

KeyPoint Relative Position Encoding for Face Recognition

Cell Variational Information Bottleneck Network

Keyword: augmentation

Robust Model Based Reinforcement Learning Using $\mathcal{L}_1$ Adaptive Control

Addressing Concept Shift in Online Time Series Forecasting: Detect-then-Adapt

Cell Tracking according to Biological Needs -- Strong Mitosis-aware Random-finite Sets Tracker with Aleatoric Uncertainty

Vehicle Detection Performance in Nordic Region

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Transfer CLIP for Generalizable Image Denoising

Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion