New submissions for Wed, 27 Mar 24

Keyword: detection

SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Authors: Authors: Yiming Xie, Henglu Wei, Zhenyi Liu, Xiaoyu Wang, Xiangyang Ji
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.17094
Pdf link: https://arxiv.org/pdf/2403.17094
Abstract To advance research in learning-based defogging algorithms, various synthetic fog datasets have been developed. However, existing datasets created using the Atmospheric Scattering Model (ASM) or real-time rendering engines often struggle to produce photo-realistic foggy images that accurately mimic the actual imaging process. This limitation hinders the effective generalization of models from synthetic to real data. In this paper, we introduce an end-to-end simulation pipeline designed to generate photo-realistic foggy images. This pipeline comprehensively considers the entire physically-based foggy scene imaging process, closely aligning with real-world image capture methods. Based on this pipeline, we present a new synthetic fog dataset named SynFog, which features both sky light and active lighting conditions, as well as three levels of fog density. Experimental results demonstrate that models trained on SynFog exhibit superior performance in visual perception and detection accuracy compared to others when applied to real-world foggy images.
Task-Agnostic Detector for Insertion-Based Backdoor Attacks
Authors: Authors: Weimin Lyu, Xiao Lin, Songzhu Zheng, Lu Pang, Haibin Ling, Susmit Jha, Chao Chen
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2403.17155
Pdf link: https://arxiv.org/pdf/2403.17155
Abstract Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.
LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning
Authors: Authors: Siyuan Cheng, Guanhong Tao, Yingqi Liu, Guangyu Shen, Shengwei An, Shiwei Feng, Xiangzhe Xu, Kaiyuan Zhang, Shiqing Ma, Xiangyu Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2403.17188
Pdf link: https://arxiv.org/pdf/2403.17188
Abstract Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function, such that the trigger can cause misclassification for any input. In response to this, recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent, they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience, we introduce a novel backdoor attack LOTUS. Specifically, it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore, LOTUS incorporates an effective trigger focusing mechanism, ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures, and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS.
A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection
Authors: Authors: Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Earl T. Barr, Wei Le
Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.17218
Pdf link: https://arxiv.org/pdf/2403.17218
Abstract Large Language Models (LLMs) have demonstrated great potential for code generation and other software engineering tasks. Vulnerability detection is of crucial importance to maintaining the security, integrity, and trustworthiness of software systems. Precise vulnerability detection requires reasoning about the code, making it a good case study for exploring the limits of LLMs' reasoning capabilities. Although recent work has applied LLMs to vulnerability detection using generic prompting techniques, their full capabilities for this task and the types of errors they make when explaining identified vulnerabilities remain unclear. In this paper, we surveyed eleven LLMs that are state-of-the-art in code generation and commonly used as coding assistants, and evaluated their capabilities for vulnerability detection. We systematically searched for the best-performing prompts, incorporating techniques such as in-context learning and chain-of-thought, and proposed three of our own prompting methods. Our results show that while our prompting methods improved the models' performance, LLMs generally struggled with vulnerability detection. They reported 0.5-0.63 Balanced Accuracy and failed to distinguish between buggy and fixed versions of programs in 76% of cases on average. By comprehensively analyzing and categorizing 287 instances of model reasoning, we found that 57% of LLM responses contained errors, and the models frequently predicted incorrect locations of buggy code and misidentified bug types. LLMs only correctly localized 6 out of 27 bugs in DbgBench, and these 6 bugs were predicted correctly by 70-100% of human participants. These findings suggest that despite their potential for other tasks, LLMs may fail to properly comprehend critical code structures and security-related concepts. Our data and code are available at https://figshare.com/s/78fe02e56e09ec49300b.
SPLICE: A Singleton-Enhanced PipeLIne for Coreference REsolution
Authors: Authors: Yilun Zhu, Siyao Peng, Sameer Pradhan, Amir Zeldes
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.17245
Pdf link: https://arxiv.org/pdf/2403.17245
Abstract Singleton mentions, i.e.~entities mentioned only once in a text, are important to how humans understand discourse from a theoretical perspective. However previous attempts to incorporate their detection in end-to-end neural coreference resolution for English have been hampered by the lack of singleton mention spans in the OntoNotes benchmark. This paper addresses this limitation by combining predicted mentions from existing nested NER systems and features derived from OntoNotes syntax trees. With this approach, we create a near approximation of the OntoNotes dataset with all singleton mentions, achieving ~94% recall on a sample of gold singletons. We then propose a two-step neural mention and coreference resolution system, named SPLICE, and compare its performance to the end-to-end approach in two scenarios: the OntoNotes test set and the out-of-domain (OOD) OntoGUM corpus. Results indicate that reconstructed singleton training yields results comparable to end-to-end systems for OntoNotes, while improving OOD stability (+1.1 avg. F1). We conduct error analysis for mention detection and delve into its impact on coreference clustering, revealing that precision improvements deliver more substantial benefits than increases in recall for resolving coreference chains.
Staircase Localization for Autonomous Exploration in Urban Environments
Authors: Authors: Jinrae Kim, Sunggoo Jung, Sung-Kyun Kim, Youdan Kim, Ali-akbar Agha-mohammadi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17330
Pdf link: https://arxiv.org/pdf/2403.17330
Abstract A staircase localization method is proposed for robots to explore urban environments autonomously. The proposed method employs a modular design in the form of a cascade pipeline consisting of three modules of stair detection, line segment detection, and stair localization modules. The stair detection module utilizes an object detection algorithm based on deep learning to generate a region of interest (ROI). From the ROI, line segment features are extracted using a deep line segment detection algorithm. The extracted line segments are used to localize a staircase in terms of position, orientation, and stair direction. The stair detection and localization are performed only with a single RGB-D camera. Each component of the proposed pipeline does not need to be designed particularly for staircases, which makes it easy to maintain the whole pipeline and replace each component with state-of-the-art deep learning detection techniques. The results of real-world experiments show that the proposed method can perform accurate stair detection and localization during autonomous exploration for various structured and unstructured upstairs and downstairs with shadows, dirt, and occlusions by artificial and natural objects.
FedMIL: Federated-Multiple Instance Learning for Video Analysis with Optimized DPP Scheduling
Authors: Authors: Ashish Bastola, Hao Wang, Xiwen Chen, Abolfazl Razi
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2403.17331
Pdf link: https://arxiv.org/pdf/2403.17331
Abstract Many AI platforms, including traffic monitoring systems, use Federated Learning (FL) for decentralized sensor data processing for learning-based applications while preserving privacy and ensuring secured information transfer. On the other hand, applying supervised learning to large data samples, like high-resolution images requires intensive human labor to label different parts of a data sample. Multiple Instance Learning (MIL) alleviates this challenge by operating over labels assigned to the 'bag' of instances. In this paper, we introduce Federated Multiple-Instance Learning (FedMIL). This framework applies federated learning to boost the training performance in video-based MIL tasks such as vehicle accident detection using distributed CCTV networks. However, data sources in decentralized settings are not typically Independently and Identically Distributed (IID), making client selection imperative to collectively represent the entire dataset with minimal clients. To address this challenge, we propose DPPQ, a framework based on the Determinantal Point Process (DPP) with a quality-based kernel to select clients with the most diverse datasets that achieve better performance compared to both random selection and current DPP-based client selection methods even with less data utilization in the majority of non-IID cases. This offers a significant advantage for deployment on edge devices with limited computational resources, providing a reliable solution for training AI models in massive smart sensor networks.
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving
Authors: Authors: Mingfu Liang, Jong-Chyi Su, Samuel Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, Manmohan Chandraker
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.17373
Pdf link: https://arxiv.org/pdf/2403.17373
Abstract Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection
Authors: Authors: Jiacheng Zhang, Jiaming Li, Xiangru Lin, Wei Zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17387
Pdf link: https://arxiv.org/pdf/2403.17387
Abstract We delve into pseudo-labeling for semi-supervised monocular 3D object detection (SSM3OD) and discover two primary issues: a misalignment between the prediction quality of 3D and 2D attributes and the tendency of depth supervision derived from pseudo-labels to be noisy, leading to significant optimization conflicts with other reliable forms of supervision. We introduce a novel decoupled pseudo-labeling (DPL) approach for SSM3OD. Our approach features a Decoupled Pseudo-label Generation (DPG) module, designed to efficiently generate pseudo-labels by separately processing 2D and 3D attributes. This module incorporates a unique homography-based method for identifying dependable pseudo-labels in BEV space, specifically for 3D attributes. Additionally, we present a DepthGradient Projection (DGP) module to mitigate optimization conflicts caused by noisy depth supervision of pseudo-labels, effectively decoupling the depth gradient and removing conflicting gradients. This dual decoupling strategy-at both the pseudo-label generation and gradient levels-significantly improves the utilization of pseudo-labels in SSM3OD. Our comprehensive experiments on the KITTI benchmark demonstrate the superiority of our method over existing approaches.
SSF3D: Strict Semi-Supervised 3D Object Detection with Switching Filter
Authors: Authors: Songbur Wong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17390
Pdf link: https://arxiv.org/pdf/2403.17390
Abstract SSF3D modified the semi-supervised 3D object detection (SS3DOD) framework, which designed specifically for point cloud data. Leveraging the characteristics of non-coincidence and weak correlation of target objects in point cloud, we adopt a strategy of retaining only the truth-determining pseudo labels and trimming the other fuzzy labels with points, instead of pursuing a balance between the quantity and quality of pseudo labels. Besides, we notice that changing the filter will make the model meet different distributed targets, which is beneficial to break the training bottleneck. Two mechanism are introduced to achieve above ideas: strict threshold and filter switching. The experiments are conducted to analyze the effectiveness of above approaches and their impact on the overall performance of the system. Evaluating on the KITTI dataset, SSF3D exhibits superior performance compared to the current state-of-the-art methods. The code will be released here.
LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection
Authors: Authors: Yunpeng Luo, Junlong Du, Ke Yan, Shouhong Ding
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.17465
Pdf link: https://arxiv.org/pdf/2403.17465
Abstract The evolution of Diffusion Models has dramatically improved image generation quality, making it increasingly difficult to differentiate between real and generated images. This development, while impressive, also raises significant privacy and security concerns. In response to this, we propose a novel Latent REconstruction error guided feature REfinement method (LaRE^2) for detecting the diffusion-generated images. We come up with the Latent Reconstruction Error (LaRE), the first reconstruction-error based feature in the latent space for generated image detection. LaRE surpasses existing methods in terms of feature extraction efficiency while preserving crucial cues required to differentiate between the real and the fake. To exploit LaRE, we propose an Error-Guided feature REfinement module (EGRE), which can refine the image feature guided by LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an align-then-refine mechanism, which effectively refines the image feature for generated-image detection from both spatial and channel perspectives. Extensive experiments on the large-scale GenImage benchmark demonstrate the superiority of our LaRE^2, which surpasses the best SoTA method by up to 11.9%/12.1% average ACC/AP across 8 different image generators. LaRE also surpasses existing methods in terms of feature extraction cost, delivering an impressive speed enhancement of 8 times.
Natural Language Requirements Testability Measurement Based on Requirement Smells
Authors: Authors: Morteza Zakeri-Nasrabadi, Saeed Parsa
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.17479
Pdf link: https://arxiv.org/pdf/2403.17479
Abstract Requirements form the basis for defining software systems' obligations and tasks. Testable requirements help prevent failures, reduce maintenance costs, and make it easier to perform acceptance tests. However, despite the importance of measuring and quantifying requirements testability, no automatic approach for measuring requirements testability has been proposed based on the requirements smells, which are at odds with the requirements testability. This paper presents a mathematical model to evaluate and rank the natural language requirements testability based on an extensive set of nine requirements smells, detected automatically, and acceptance test efforts determined by requirement length and its application domain. Most of the smells stem from uncountable adjectives, context-sensitive, and ambiguous words. A comprehensive dictionary is required to detect such words. We offer a neural word-embedding technique to generate such a dictionary automatically. Using the dictionary, we could automatically detect Polysemy smell (domain-specific ambiguity) for the first time in 10 application domains. Our empirical study on nearly 1000 software requirements from six well-known industrial and academic projects demonstrates that the proposed smell detection approach outperforms Smella, a state-of-the-art tool, in detecting requirements smells. The precision and recall of smell detection are improved with an average of 0.03 and 0.33, respectively, compared to the state-of-the-art. The proposed requirement testability model measures the testability of 985 requirements with a mean absolute error of 0.12 and a mean squared error of 0.03, demonstrating the model's potential for practical use.
FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids
Authors: Authors: Emad Efatinasab, Francesco Marchiori, Alessandro Brighente, Mirco Rampazzo, Mauro Conti
Subjects: Cryptography and Security (cs.CR); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2403.17494
Pdf link: https://arxiv.org/pdf/2403.17494
Abstract Predicting and classifying faults in electricity networks is crucial for uninterrupted provision and keeping maintenance costs at a minimum. Thanks to the advancements in the field provided by the smart grid, several data-driven approaches have been proposed in the literature to tackle fault prediction tasks. Implementing these systems brought several improvements, such as optimal energy consumption and quick restoration. Thus, they have become an essential component of the smart grid. However, the robustness and security of these systems against adversarial attacks have not yet been extensively investigated. These attacks can impair the whole grid and cause additional damage to the infrastructure, deceiving fault detection systems and disrupting restoration. In this paper, we present FaultGuard, the first framework for fault type and zone classification resilient to adversarial attacks. To ensure the security of our system, we employ an Anomaly Detection System (ADS) leveraging a novel Generative Adversarial Network training layer to identify attacks. Furthermore, we propose a low-complexity fault prediction model and an online adversarial training technique to enhance robustness. We comprehensively evaluate the framework's performance against various adversarial attacks using the IEEE13-AdvAttack dataset, which constitutes the state-of-the-art for resilient fault prediction benchmarking. Our model outclasses the state-of-the-art even without considering adversaries, with an accuracy of up to 0.958. Furthermore, our ADS shows attack detection capabilities with an accuracy of up to 1.000. Finally, we demonstrate how our novel training layers drastically increase performances across the whole framework, with a mean increase of 154% in ADS accuracy and 118% in model accuracy.
Detection of Deepfake Environmental Audio
Authors: Authors: Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2403.17529
Pdf link: https://arxiv.org/pdf/2403.17529
Abstract With the ever-rising quality of deep generative models, it is increasingly important to be able to discern whether the audio data at hand have been recorded or synthesized. Although the detection of fake speech signals has been studied extensively, this is not the case for the detection of fake environmental audio. We propose a simple and efficient pipeline for detecting fake environmental sounds based on the CLAP audio embedding. We evaluate this detector using audio data from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-art synthesizers can be detected on average with 98% accuracy. We show that using an audio embedding learned on environmental audio is beneficial over a standard VGGish one as it provides a 10% increase in detection performance. Informal listening to Incorrect Negative examples demonstrates audible features of fake sounds missed by the detector such as distortion and implausible background noise.
Practical Applications of Advanced Cloud Services and Generative AI Systems in Medical Image Analysis
Authors: Authors: Jingyu Xu, Binbin Wu, Jiaxin Huang, Yulu Gong, Yifan Zhang, Bo Liu
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17549
Pdf link: https://arxiv.org/pdf/2403.17549
Abstract The medical field is one of the important fields in the application of artificial intelligence technology. With the explosive growth and diversification of medical data, as well as the continuous improvement of medical needs and challenges, artificial intelligence technology is playing an increasingly important role in the medical field. Artificial intelligence technologies represented by computer vision, natural language processing, and machine learning have been widely penetrated into diverse scenarios such as medical imaging, health management, medical information, and drug research and development, and have become an important driving force for improving the level and quality of medical services.The article explores the transformative potential of generative AI in medical imaging, emphasizing its ability to generate syntheticACM-2 data, enhance images, aid in anomaly detection, and facilitate image-to-image translation. Despite challenges like model complexity, the applications of generative models in healthcare, including Med-PaLM 2 technology, show promising results. By addressing limitations in dataset size and diversity, these models contribute to more accurate diagnoses and improved patient outcomes. However, ethical considerations and collaboration among stakeholders are essential for responsible implementation. Through experiments leveraging GANs to augment brain tumor MRI datasets, the study demonstrates how generative AI can enhance image quality and diversity, ultimately advancing medical diagnostics and patient care.
RuBia: A Russian Language Bias Detection Dataset
Authors: Authors: Veronika Grigoreva, Anastasiia Ivanova, Ilseyar Alimova, Ekaterina Artemova
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.17553
Pdf link: https://arxiv.org/pdf/2403.17553
Abstract Warning: this work contains upsetting or disturbing content. Large language models (LLMs) tend to learn the social and cultural biases present in the raw pre-training data. To test if an LLM's behavior is fair, functional datasets are employed, and due to their purpose, these datasets are highly language and culture-specific. In this paper, we address a gap in the scope of multilingual bias evaluation by presenting a bias detection dataset specifically designed for the Russian language, dubbed as RuBia. The RuBia dataset is divided into 4 domains: gender, nationality, socio-economic status, and diverse, each of the domains is further divided into multiple fine-grained subdomains. Every example in the dataset consists of two sentences with the first reinforcing a potentially harmful stereotype or trope and the second contradicting it. These sentence pairs were first written by volunteers and then validated by native-speaking crowdsourcing workers. Overall, there are nearly 2,000 unique sentence pairs spread over 19 subdomains in RuBia. To illustrate the dataset's purpose, we conduct a diagnostic evaluation of state-of-the-art or near-state-of-the-art LLMs and discuss the LLMs' predisposition to social biases.
Ultrafast Adaptive Primary Frequency Tuning and Secondary Frequency Identification for S/S WPT system
Authors: Authors: Chang Liu, Wei Han, Guangyu Yan, Bowang Zhang, Chunlin Li
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2403.17598
Pdf link: https://arxiv.org/pdf/2403.17598
Abstract Magnetic resonance wireless power transfer (WPT) technology is increasingly being adopted across diverse applications. However, its effectiveness can be significantly compromised by parameter shifts within the resonance network, owing to its high system quality factor. Such shifts are inherent and challenging to mitigate during the manufacturing process. In response, this article introduces a rapid frequency tuning approach. Leveraging switch-controlled capacitors (SCC) to adjust the resonance network and the primary side's operating frequency, alongside a current zero-crossing detection (ZCD) circuit for voltage-current phase determination, this method circumvents the need for intricate knowledge of WPT system parameters. Moreover, it obviates the necessity for inter-side communication for real-time identification of the secondary side resonance frequency. The swift response of SCC and two-step perturb-and-observe algorithm mitigate output disturbances, thereby expediting the frequency tuning process. Experimental validation on a 200W Series-Series compensated WPT (SS-WPT) system demonstrates that the proposed method achieves frequency recognition accuracy within 0.7kHz in less than 1ms, increasing system efficiency up to 9%.
Fake or JPEG? Revealing Common Biases in Generated Image Detection Datasets
Authors: Authors: Patrick Grommelt, Louis Weiss, Franz-Josef Pfreundt, Janis Keuper
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.17608
Pdf link: https://arxiv.org/pdf/2403.17608
Abstract The widespread adoption of generative image models has highlighted the urgent need to detect artificial content, which is a crucial step in combating widespread manipulation and misinformation. Consequently, numerous detectors and associated datasets have emerged. However, many of these datasets inadvertently introduce undesirable biases, thereby impacting the effectiveness and evaluation of detectors. In this paper, we emphasize that many datasets for AI-generated image detection contain biases related to JPEG compression and image size. Using the GenImage dataset, we demonstrate that detectors indeed learn from these undesired factors. Furthermore, we show that removing the named biases substantially increases robustness to JPEG compression and significantly alters the cross-generator performance of evaluated detectors. Specifically, it leads to more than 11 percentage points increase in cross-generator performance for ResNet50 and Swin-T detectors on the GenImage dataset, achieving state-of-the-art results. We provide the dataset and source codes of this paper on the anonymous website: https://www.unbiased-genimage.org
Denoising Table-Text Retrieval for Open-Domain Question Answering
Authors: Authors: Deokhyung Kang, Baikjin Jung, Yunsu Kim, Gary Geunbae Lee
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.17611
Pdf link: https://arxiv.org/pdf/2403.17611
Abstract In table-text open-domain question answering, a retriever system retrieves relevant evidence from tables and text to answer questions. Previous studies in table-text open-domain question answering have two common challenges: firstly, their retrievers can be affected by false-positive labels in training datasets; secondly, they may struggle to provide appropriate evidence for questions that require reasoning across the table. To address these issues, we propose Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a denoised training dataset with fewer false positive labels by discarding instances with lower question-relevance scores measured through a false positive detection model. Subsequently, we integrate table-level ranking information into the retriever to assist in finding evidence for questions that demand reasoning across the table. To encode this ranking information, we fine-tune a rank-aware column encoder to identify minimum and maximum values within a column. Experimental results demonstrate that DoTTeR significantly outperforms strong baselines on both retrieval recall and downstream QA tasks. Our code is available at https://github.com/deokhk/DoTTeR.
UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps
Authors: Authors: Maciej K Wozniak, Mattias Hansson, Marko Thiel, Patric Jensfelt
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.17633
Pdf link: https://arxiv.org/pdf/2403.17633
Abstract In this study, we address a gap in existing unsupervised domain adaptation approaches on LiDAR-based 3D object detection, which have predominantly concentrated on adapting between established, high-density autonomous driving datasets. We focus on sparser point clouds, capturing scenarios from different perspectives: not just from vehicles on the road but also from mobile robots on sidewalks, which encounter significantly different environmental conditions and sensor configurations. We introduce Unsupervised Adversarial Domain Adaptation for 3D Object Detection (UADA3D). UADA3D does not depend on pre-trained source models or teacher-student architectures. Instead, it uses an adversarial approach to directly learn domain-invariant features. We demonstrate its efficacy in various adaptation scenarios, showing significant improvements in both self-driving car and mobile robot domains. Our code is open-source and will be available soon.
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition
Authors: Authors: Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Ericsson, Zhenyu Wang, Jiaming Liu, Elliot J. Crowley
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.17695
Pdf link: https://arxiv.org/pdf/2403.17695
Abstract We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba
The Solution for the CVPR 2023 1st foundation model challenge-Track2
Authors: Authors: Haonan Xu, Yurui Huang, Sishun Pan, Zhihao Guan, Yi Xu, Yang Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17702
Pdf link: https://arxiv.org/pdf/2403.17702
Abstract In this paper, we propose a solution for cross-modal transportation retrieval. Due to the cross-domain problem of traffic images, we divide the problem into two sub-tasks of pedestrian retrieval and vehicle retrieval through a simple strategy. In pedestrian retrieval tasks, we use IRRA as the base model and specifically design an Attribute Classification to mine the knowledge implied by attribute labels. More importantly, We use the strategy of Inclusion Relation Matching to make the image-text pairs with inclusion relation have similar representation in the feature space. For the vehicle retrieval task, we use BLIP as the base model. Since aligning the color attributes of vehicles is challenging, we introduce attribute-based object detection techniques to add color patch blocks to vehicle images for color data augmentation. This serves as strong prior information, helping the model perform the image-text alignment. At the same time, we incorporate labeled attributes into the image-text alignment loss to learn fine-grained alignment and prevent similar images and texts from being incorrectly separated. Our approach ranked first in the final B-board test with a score of 70.9.
Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection
Authors: Authors: Jongha Kim, Jihwan Park, Jinyoung Park, Jinyoung Kim, Sehyung Kim, Hyunwoo J. Kim
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17709
Pdf link: https://arxiv.org/pdf/2403.17709
Abstract Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However, we identify two key limitations in a conventional label assignment for training Transformer-based VRD models, which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment, an unspecialized query is trained since a query is expected to detect every relation, which makes it difficult for a query to specialize in specific relations. Furthermore, a query is also insufficiently trained since a GT is assigned only to a single prediction, therefore near-correct or even correct predictions are suppressed by being assigned no relation as a GT. To address these issues, we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a specialized query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject, an object, and the relation in between. Experimental results and analyses show that SpeaQ effectively trains specialized queries, which better utilize the capacity of a model, resulting in consistent performance gains with zero additional inference cost across multiple VRD models and benchmarks. Code is available at https://github.com/mlvlab/SpeaQ.
Invisible Gas Detection: An RGB-Thermal Cross Attention Network and A New Benchmark
Authors: Authors: Jue Wang, Yuxiang Lin, Qi Zhao, Dong Luo, Shuaibao Chen, Wei Chen, Xiaojiang Peng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17712
Pdf link: https://arxiv.org/pdf/2403.17712
Abstract The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the RGB-Thermal Cross Attention Network (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data will be made available soon.
Deep Learning for Segmentation of Cracks in High-Resolution Images of Steel Bridges
Authors: Authors: Andrii Kompanets, Gautam Pai, Remco Duits, Davide Leonetti, Bert Snijder
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2403.17725
Pdf link: https://arxiv.org/pdf/2403.17725
Abstract Automating the current bridge visual inspection practices using drones and image processing techniques is a prominent way to make these inspections more effective, robust, and less expensive. In this paper, we investigate the development of a novel deep-learning method for the detection of fatigue cracks in high-resolution images of steel bridges. First, we present a novel and challenging dataset comprising of images of cracks in steel bridges. Secondly, we integrate the ConvNext neural network with a previous state- of-the-art encoder-decoder network for crack segmentation. We study and report, the effects of the use of background patches on the network performance when applied to high-resolution images of cracks in steel bridges. Finally, we introduce a loss function that allows the use of more background patches for the training process, which yields a significant reduction in false positive rates.
Continual Few-shot Event Detection via Hierarchical Augmentation Networks
Authors: Authors: Chenlong Zhang, Pengfei Cao, Yubo Chen, Kang Liu, Zhiqiang Zhang, Mengshu Sun, Jun Zhao
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.17733
Pdf link: https://arxiv.org/pdf/2403.17733
Abstract Traditional continual event detection relies on abundant labeled data for training, which is often impractical to obtain in real-world applications. In this paper, we introduce continual few-shot event detection (CFED), a more commonly encountered scenario when a substantial number of labeled samples are not accessible. The CFED task is challenging as it involves memorizing previous event types and learning new event types with few-shot samples. To mitigate these challenges, we propose a memory-based framework: Hierarchical Augmentation Networks (HANet). To memorize previous event types with limited memory, we incorporate prototypical augmentation into the memory set. For the issue of learning new event types in few-shot scenarios, we propose a contrastive augmentation module for token representations. Despite comparing with previous state-of-the-art methods, we also conduct comparisons with ChatGPT. Experiment results demonstrate that our method significantly outperforms all of these methods in multiple continual few-shot event detection tasks.
Out-of-distribution Rumor Detection via Test-Time Adaptation
Authors: Authors: Xiang Tao, Mingqing Zhang, Qiang Liu, Shu Wu, Liang Wang
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.17735
Pdf link: https://arxiv.org/pdf/2403.17735
Abstract Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Existing methods for rumor detection have achieved good performance, as they have collected enough corpus from the same data distribution for model training. However, significant distribution shifts between the training data and real-world test data occur due to differences in news topics, social media platforms, languages and the variance in propagation scale caused by news popularity. This leads to a substantial decline in the performance of these existing methods in Out-Of-Distribution (OOD) situations. To address this problem, we propose a simple and efficient method named Test-time Adaptation for Rumor Detection under distribution shifts (TARD). This method models the propagation of news in the form of a propagation graph, and builds propagation graph test-time adaptation framework, enhancing the model's adaptability and robustness when facing OOD problems. Extensive experiments conducted on two group datasets collected from real-world social platforms demonstrate that our framework outperforms the state-of-the-art methods in performance.
LiDAR-Based Crop Row Detection Algorithm for Over-Canopy Autonomous Navigation in Agriculture Fields
Authors: Authors: Ruiji Liu, Francisco Yandun, George Kantor
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.17774
Pdf link: https://arxiv.org/pdf/2403.17774
Abstract Autonomous navigation is crucial for various robotics applications in agriculture. However, many existing methods depend on RTK-GPS systems, which are expensive and susceptible to poor signal coverage. This paper introduces a state-of-the-art LiDAR-based navigation system that can achieve over-canopy autonomous navigation in row-crop fields, even when the canopy fully blocks the interrow spacing. Our crop row detection algorithm can detect crop rows across diverse scenarios, encompassing various crop types, growth stages, weed presence, and discontinuities within the crop rows. Without utilizing the global localization of the robot, our navigation system can perform autonomous navigation in these challenging scenarios, detect the end of the crop rows, and navigate to the next crop row autonomously, providing a crop-agnostic approach to navigate the whole row-crop field. This navigation system has undergone tests in various simulated agricultural fields, achieving an average of $2.98cm$ autonomous driving accuracy without human intervention on the custom Amiga robot. In addition, the qualitative results of our crop row detection algorithm from the actual soybean fields validate our LiDAR-based crop row detection algorithm's potential for practical agricultural applications.
Optical Flow Based Detection and Tracking of Moving Objects for Autonomous Vehicles
Authors: Authors: MReza Alipour Sormoli, Mehrdad Dianati, Sajjad Mozaffari, Roger woodman
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2403.17779
Pdf link: https://arxiv.org/pdf/2403.17779
Abstract Accurate velocity estimation of surrounding moving objects and their trajectories are critical elements of perception systems in Automated/Autonomous Vehicles (AVs) with a direct impact on their safety. These are non-trivial problems due to the diverse types and sizes of such objects and their dynamic and random behaviour. Recent point cloud based solutions often use Iterative Closest Point (ICP) techniques, which are known to have certain limitations. For example, their computational costs are high due to their iterative nature, and their estimation error often deteriorates as the relative velocities of the target objects increase (>2 m/sec). Motivated by such shortcomings, this paper first proposes a novel Detection and Tracking of Moving Objects (DATMO) for AVs based on an optical flow technique, which is proven to be computationally efficient and highly accurate for such problems. \textcolor{black}{This is achieved by representing the driving scenario as a vector field and applying vector calculus theories to ensure spatiotemporal continuity.} We also report the results of a comprehensive performance evaluation of the proposed DATMO technique, carried out in this study using synthetic and real-world data. The results of this study demonstrate the superiority of the proposed technique, compared to the DATMO techniques in the literature, in terms of estimation accuracy and processing time in a wide range of relative velocities of moving objects. Finally, we evaluate and discuss the sensitivity of the estimation error of the proposed DATMO technique to various system and environmental parameters, as well as the relative velocities of the moving objects.
A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities
Authors: Authors: Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Bastian Wittmann, Enis Simsar, Mehmet Simsar, Emine Bensu Erdemir, Abdullah Alanbay, Anjany Sekuboyina, Berkan Lafci, Mehmet K. Ozdemir, Bjoern Menze
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17834
Pdf link: https://arxiv.org/pdf/2403.17834
Abstract A major challenge in computational research in 3D medical imaging is the lack of comprehensive datasets. Addressing this issue, our study introduces CT-RATE, the first 3D medical imaging dataset that pairs images with textual reports. CT-RATE consists of 25,692 non-contrast chest CT volumes, expanded to 50,188 through various reconstructions, from 21,304 unique patients, along with corresponding radiology text reports. Leveraging CT-RATE, we developed CT-CLIP, a CT-focused contrastive language-image pre-training framework. As a versatile, self-supervised model, CT-CLIP is designed for broad application and does not require task-specific training. Remarkably, CT-CLIP outperforms state-of-the-art, fully supervised methods in multi-abnormality detection across all key metrics, thus eliminating the need for manual annotation. We also demonstrate its utility in case retrieval, whether using imagery or textual queries, thereby advancing knowledge dissemination. The open-source release of CT-RATE and CT-CLIP marks a significant advancement in medical AI, enhancing 3D imaging analysis and fostering innovation in healthcare.
Deepfake Generation and Detection: A Benchmark and Survey
Authors: Authors: Gan Pei, Jiangning Zhang, Menghan Hu, Guangtao Zhai, Chengjie Wang, Zhenyu Zhang, Jian Yang, Chunhua Shen, Dacheng Tao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17881
Pdf link: https://arxiv.org/pdf/2403.17881
Abstract In addition to the advancements in deepfake generation, corresponding detection technologies need to continuously evolve to regulate the potential misuse of deepfakes, such as for privacy invasion and phishing attacks. This survey comprehensively reviews the latest developments in deepfake generation and detection, summarizing and analyzing the current state of the art in this rapidly evolving field. We first unify task definitions, comprehensively introduce datasets and metrics, and discuss the development of generation and detection technology frameworks. Then, we discuss the development of several related sub-fields and focus on researching four mainstream deepfake fields: popular face swap, face reenactment, talking face generation, and facial attribute editing, as well as foreign detection. Subsequently, we comprehensively benchmark representative methods on popular datasets for each field, fully evaluating the latest and influential works published in top conferences/journals. Finally, we analyze the challenges and future research directions of the discussed fields. We closely follow the latest developments in https://github.com/flyingby/Awesome-Deepfake-Generation-and-Detection.
Sen2Fire: A Challenging Benchmark Dataset for Wildfire Detection using Sentinel Data
Authors: Authors: Yonghao Xu, Amanda Berg, Leif Haglund
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17884
Pdf link: https://arxiv.org/pdf/2403.17884
Abstract Utilizing satellite imagery for wildfire detection presents substantial potential for practical applications. To advance the development of machine learning algorithms in this domain, our study introduces the \textit{Sen2Fire} dataset--a challenging satellite remote sensing dataset tailored for wildfire detection. This dataset is curated from Sentinel-2 multi-spectral data and Sentinel-5P aerosol product, comprising a total of 2466 image patches. Each patch has a size of 512$\times$512 pixels with 13 bands. Given the distinctive sensitivities of various wavebands to wildfire responses, our research focuses on optimizing wildfire detection by evaluating different wavebands and employing a combination of spectral indices, such as normalized burn ratio (NBR) and normalized difference vegetation index (NDVI). The results suggest that, in contrast to using all bands for wildfire detection, selecting specific band combinations yields superior performance. Additionally, our study underscores the positive impact of integrating Sentinel-5 aerosol data for wildfire detection. The code and dataset are available online (https://zenodo.org/records/10881058).
Image-based Novel Fault Detection with Deep Learning Classifiers using Hierarchical Labels
Authors: Authors: Nurettin Sergin, Jiayu Huang, Tzyy-Shuh Chang, Hao Yan
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.17891
Pdf link: https://arxiv.org/pdf/2403.17891
Abstract One important characteristic of modern fault classification systems is the ability to flag the system when faced with previously unseen fault types. This work considers the unknown fault detection capabilities of deep neural network-based fault classifiers. Specifically, we propose a methodology on how, when available, labels regarding the fault taxonomy can be used to increase unknown fault detection performance without sacrificing model performance. To achieve this, we propose to utilize soft label techniques to improve the state-of-the-art deep novel fault detection techniques during the training process and novel hierarchically consistent detection statistics for online novel fault detection. Finally, we demonstrated increased detection performance on novel fault detection in inspection images from the hot steel rolling process, with results well replicated across multiple scenarios and baseline detection methods.
Multi-Agent Resilient Consensus under Intermittent Faulty and Malicious Transmissions (Extended Version)
Authors: Authors: Sarper Aydın, Orhan Eren Akgün, Stephanie Gil, Angelia Nedić
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2403.17907
Pdf link: https://arxiv.org/pdf/2403.17907
Abstract In this work, we consider the consensus problem in which legitimate agents share their values over an undirected communication network in the presence of malicious or faulty agents. Different from the previous works, we characterize the conditions that generalize to several scenarios such as intermittent faulty or malicious transmissions, based on trust observations. As the standard trust aggregation approach based on a constant threshold fails to distinguish intermittent malicious/faulty activity, we propose a new detection algorithm utilizing time-varying thresholds and the random trust values available to legitimate agents. Under these conditions, legitimate agents almost surely determine their trusted neighborhood correctly with geometrically decaying misclassification probabilities. We further prove that the consensus process converges almost surely even in the presence of malicious agents. We also derive the probabilistic bounds on the deviation from the nominal consensus value that would have been achieved with no malicious agents in the system. Numerical results verify the convergence among agents and exemplify the deviation under different scenarios.
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
Authors: Authors: Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Salman Khan, Fahad Shahbaz Khan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17909
Pdf link: https://arxiv.org/pdf/2403.17909
Abstract Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an Efficient Local-Global Context Aggregator module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance. Project url https://github.com/techmn/elgcnet.
Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos
Authors: Authors: Akshay Paruchuri, Samuel Ehrenstein, Shuxian Wang, Inbar Fried, Stephen M. Pizer, Marc Niethammer, Roni Sengupta
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17915
Pdf link: https://arxiv.org/pdf/2403.17915
Abstract Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues. Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects. In this paper, we utilize the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation. We first create two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. We then propose a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation. Finally, we introduce teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. We achieve state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data. Our code, pre-trained models, and supplementary materials can be found on our project page: https://ppsnet.github.io/
CMP: Cooperative Motion Prediction with Multi-Agent Communication
Authors: Authors: Zhuoyuan Wu, Yuping Wang, Hengbo Ma, Zhaowei Li, Hang Qiu, Jiachen Li
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2403.17916
Pdf link: https://arxiv.org/pdf/2403.17916
Abstract The confluence of the advancement of Autonomous Vehicles (AVs) and the maturity of Vehicle-to-Everything (V2X) communication has enabled the capability of cooperative connected and automated vehicles (CAVs). Building on top of cooperative perception, this paper explores the feasibility and effectiveness of cooperative motion prediction. Our method, CMP, takes LiDAR signals as input to enhance tracking and prediction capabilities. Unlike previous work that focuses separately on either cooperative perception or motion prediction, our framework, to the best of our knowledge, is the first to address the unified problem where CAVs share information in both perception and prediction modules. Incorporated into our design is the unique capability to tolerate realistic V2X bandwidth limitations and transmission delays, while dealing with bulky perception representations. We also propose a prediction aggregation module, which unifies the predictions obtained by different CAVs and generates the final prediction. Through extensive experiments and ablation studies, we demonstrate the effectiveness of our method in cooperative perception, tracking, and motion prediction tasks. In particular, CMP reduces the average prediction error by 17.2\% with fewer missing detections compared with the no cooperation setting. Our work marks a significant step forward in the cooperative capabilities of CAVs, showcasing enhanced performance in complex scenarios.
AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation
Authors: Authors: Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, Zhongang Cai
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17934
Pdf link: https://arxiv.org/pdf/2403.17934
Abstract Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via cropping, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody.
Keyword: face recognition

There is no result

Keyword: augmentation

Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification
Authors: Authors: Radu-Andrei Rosu, Mihaela-Elena Breaban, Henri Luchian
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.17130
Pdf link: https://arxiv.org/pdf/2403.17130
Abstract Dataset distillation aims at synthesizing a dataset by a small number of artificially generated data items, which, when used as training data, reproduce or approximate a machine learning (ML) model as if it were trained on the entire original dataset. Consequently, data distillation methods are usually tied to a specific ML algorithm. While recent literature deals mainly with distillation of large collections of images in the context of neural network models, tabular data distillation is much less represented and mainly focused on a theoretical perspective. The current paper explores the potential of a simple distillation technique previously proposed in the context of Less-than-one shot learning. The main goal is to push further the performance of prototype-based soft-labels distillation in terms of classification accuracy, by integrating optimization steps in the distillation process. The analysis is performed on real-world data sets with various degrees of imbalance. Experimental studies trace the capability of the method to distill the data, but also the opportunity to act as an augmentation method, i.e. to generate new data that is able to increase model accuracy when used in conjunction with - as opposed to instead of - the original data.
HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification
Authors: Authors: He Zhu, Junran Wu, Ruomei Liu, Yue Hou, Ze Yuan, Shangzhe Li, Yicheng Pan, Ke Xu
Subjects: Computation and Language (cs.CL); Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2403.17307
Pdf link: https://arxiv.org/pdf/2403.17307
Abstract Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely \textbf{H}ierarchy-aware \textbf{I}nformation \textbf{L}ossless contrastive \textbf{L}earning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.
Leveraging Symmetry in RL-based Legged Locomotion Control
Authors: Authors: Zhi Su, Xiaoyu Huang, Daniel Ordoñez-Apraez, Yunfei Li, Zhongyu Li, Qiayuan Liao, Giulio Turrisi, Massimiliano Pontil, Claudio Semini, Yi Wu, Koushil Sreenath
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.17320
Pdf link: https://arxiv.org/pdf/2403.17320
Abstract Model-free reinforcement learning is a promising approach for autonomously solving challenging robotics control problems, but faces exploration difficulty without information of the robot's kinematics and dynamics morphology. The under-exploration of multiple modalities with symmetric states leads to behaviors that are often unnatural and sub-optimal. This issue becomes particularly pronounced in the context of robotic systems with morphological symmetries, such as legged robots for which the resulting asymmetric and aperiodic behaviors compromise performance, robustness, and transferability to real hardware. To mitigate this challenge, we can leverage symmetry to guide and improve the exploration in policy learning via equivariance/invariance constraints. In this paper, we investigate the efficacy of two approaches to incorporate symmetry: modifying the network architectures to be strictly equivariant/invariant, and leveraging data augmentation to approximate equivariant/invariant actor-critics. We implement the methods on challenging loco-manipulation and bipedal locomotion tasks and compare with an unconstrained baseline. We find that the strictly equivariant policy consistently outperforms other methods in sample efficiency and task performance in simulation. In addition, symmetry-incorporated approaches exhibit better gait quality, higher robustness and can be deployed zero-shot in real-world experiments.
Variational Graph Auto-Encoder Based Inductive Learning Method for Semi-Supervised Classification
Authors: Authors: Hanxuan Yang, Zhaoxin Yu, Qingchao Kong, Wei Liu, Wenji Mao
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.17500
Pdf link: https://arxiv.org/pdf/2403.17500
Abstract Graph representation learning is a fundamental research issue in various domains of applications, of which the inductive learning problem is particularly challenging as it requires models to generalize to unseen graph structures during inference. In recent years, graph neural networks (GNNs) have emerged as powerful graph models for inductive learning tasks such as node classification, whereas they typically heavily rely on the annotated nodes under a fully supervised training setting. Compared with the GNN-based methods, variational graph auto-encoders (VGAEs) are known to be more generalizable to capture the internal structural information of graphs independent of node labels and have achieved prominent performance on multiple unsupervised learning tasks. However, so far there is still a lack of work focusing on leveraging the VGAE framework for inductive learning, due to the difficulties in training the model in a supervised manner and avoiding over-fitting the proximity information of graphs. To solve these problems and improve the model performance of VGAEs for inductive graph representation learning, in this work, we propose the Self-Label Augmented VGAE model. To leverage the label information for training, our model takes node labels as one-hot encoded inputs and then performs label reconstruction in model training. To overcome the scarcity problem of node labels for semi-supervised settings, we further propose the Self-Label Augmentation Method (SLAM), which uses pseudo labels generated by our model with a node-wise masking approach to enhance the label information. Experiments on benchmark inductive learning graph datasets verify that our proposed model archives promising results on node classification with particular superiority under semi-supervised learning settings.
The Solution for the CVPR 2023 1st foundation model challenge-Track2
Authors: Authors: Haonan Xu, Yurui Huang, Sishun Pan, Zhihao Guan, Yi Xu, Yang Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.17702
Pdf link: https://arxiv.org/pdf/2403.17702
Abstract In this paper, we propose a solution for cross-modal transportation retrieval. Due to the cross-domain problem of traffic images, we divide the problem into two sub-tasks of pedestrian retrieval and vehicle retrieval through a simple strategy. In pedestrian retrieval tasks, we use IRRA as the base model and specifically design an Attribute Classification to mine the knowledge implied by attribute labels. More importantly, We use the strategy of Inclusion Relation Matching to make the image-text pairs with inclusion relation have similar representation in the feature space. For the vehicle retrieval task, we use BLIP as the base model. Since aligning the color attributes of vehicles is challenging, we introduce attribute-based object detection techniques to add color patch blocks to vehicle images for color data augmentation. This serves as strong prior information, helping the model perform the image-text alignment. At the same time, we incorporate labeled attributes into the image-text alignment loss to learn fine-grained alignment and prevent similar images and texts from being incorrectly separated. Our approach ranked first in the final B-board test with a score of 70.9.
Continual Few-shot Event Detection via Hierarchical Augmentation Networks
Authors: Authors: Chenlong Zhang, Pengfei Cao, Yubo Chen, Kang Liu, Zhiqiang Zhang, Mengshu Sun, Jun Zhao
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.17733
Pdf link: https://arxiv.org/pdf/2403.17733
Abstract Traditional continual event detection relies on abundant labeled data for training, which is often impractical to obtain in real-world applications. In this paper, we introduce continual few-shot event detection (CFED), a more commonly encountered scenario when a substantial number of labeled samples are not accessible. The CFED task is challenging as it involves memorizing previous event types and learning new event types with few-shot samples. To mitigate these challenges, we propose a memory-based framework: Hierarchical Augmentation Networks (HANet). To memorize previous event types with limited memory, we incorporate prototypical augmentation into the memory set. For the issue of learning new event types in few-shot scenarios, we propose a contrastive augmentation module for token representations. Despite comparing with previous state-of-the-art methods, we also conduct comparisons with ChatGPT. Experiment results demonstrate that our method significantly outperforms all of these methods in multiple continual few-shot event detection tasks.
Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications
Authors: Authors: Philip Lippmann, Matthijs Spaan, Jie Yang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.17860
Pdf link: https://arxiv.org/pdf/2403.17860
Abstract Natural Language Processing (NLP) models optimized for predictive performance often make high confidence errors and suffer from vulnerability to adversarial and out-of-distribution data. Existing work has mainly focused on mitigation of such errors using either humans or an automated approach. In this study, we explore the usage of large language models (LLMs) for data augmentation as a potential solution to the issue of NLP models making wrong predictions with high confidence during classification tasks. We compare the effectiveness of synthetic data generated by LLMs with that of human data obtained via the same procedure. For mitigation, humans or LLMs provide natural language characterizations of high confidence misclassifications to generate synthetic data, which are then used to extend the training set. We conduct an extensive evaluation of our approach on three classification tasks and demonstrate its effectiveness in reducing the number of high confidence misclassifications present in the model, all while maintaining the same level of accuracy. Moreover, we find that the cost gap between humans and LLMs surpasses an order of magnitude, as LLMs attain human-like performance while being more scalable.

LeeKyungwook / get-arxiv-noti

New submissions for Wed, 27 Mar 24 #1037

Keyword: detection

SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving

Task-Agnostic Detector for Insertion-Based Backdoor Attacks

LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

SPLICE: A Singleton-Enhanced PipeLIne for Coreference REsolution

Staircase Localization for Autonomous Exploration in Urban Environments

FedMIL: Federated-Multiple Instance Learning for Video Analysis with Optimized DPP Scheduling

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection

SSF3D: Strict Semi-Supervised 3D Object Detection with Switching Filter

LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection

Natural Language Requirements Testability Measurement Based on Requirement Smells

FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids

Detection of Deepfake Environmental Audio

Practical Applications of Advanced Cloud Services and Generative AI Systems in Medical Image Analysis

RuBia: A Russian Language Bias Detection Dataset

Ultrafast Adaptive Primary Frequency Tuning and Secondary Frequency Identification for S/S WPT system

Fake or JPEG? Revealing Common Biases in Generated Image Detection Datasets

Denoising Table-Text Retrieval for Open-Domain Question Answering

UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

The Solution for the CVPR 2023 1st foundation model challenge-Track2

Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection

Invisible Gas Detection: An RGB-Thermal Cross Attention Network and A New Benchmark

Deep Learning for Segmentation of Cracks in High-Resolution Images of Steel Bridges

Continual Few-shot Event Detection via Hierarchical Augmentation Networks

Out-of-distribution Rumor Detection via Test-Time Adaptation

LiDAR-Based Crop Row Detection Algorithm for Over-Canopy Autonomous Navigation in Agriculture Fields

Optical Flow Based Detection and Tracking of Moving Objects for Autonomous Vehicles

A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities

Deepfake Generation and Detection: A Benchmark and Survey

Sen2Fire: A Challenging Benchmark Dataset for Wildfire Detection using Sentinel Data

Image-based Novel Fault Detection with Deep Learning Classifiers using Hierarchical Labels

Multi-Agent Resilient Consensus under Intermittent Faulty and Malicious Transmissions (Extended Version)

ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection

Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

CMP: Cooperative Motion Prediction with Multi-Agent Communication

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Keyword: face recognition

Keyword: augmentation

Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification

HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification

Leveraging Symmetry in RL-based Legged Locomotion Control

Variational Graph Auto-Encoder Based Inductive Learning Method for Semi-Supervised Classification

The Solution for the CVPR 2023 1st foundation model challenge-Track2

Continual Few-shot Event Detection via Hierarchical Augmentation Networks

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications