New submissions for Tue, 2 Jan 24

Keyword: detection

6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Authors: Authors: Li Xu, Haoxuan Qu, Yujun Cai, Jun Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00029
Pdf link: https://arxiv.org/pdf/2401.00029
Abstract Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.
Generating Enhanced Negatives for Training Language-Based Object Detectors
Authors: Authors: Shiyu Zhao, Long Zhao, Vijay Kumar B.G, Yumin Suh, Dimitris N. Metaxas, Manmohan Chandraker, Samuel Schulter
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00094
Pdf link: https://arxiv.org/pdf/2401.00094
Abstract The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative samples. However, the free-form nature and the open vocabulary of object descriptions make the space of negatives extremely large. Prior works randomly sample negatives or use rule-based techniques to build them. In contrast, we propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data. Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images. Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
Enabling Smart Retrofitting and Performance Anomaly Detection for a Sensorized Vessel: A Maritime Industry Experience
Authors: Authors: Mahshid Helali Moghadam, Mateusz Rzymowski, Lukasz Kulas
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.00112
Pdf link: https://arxiv.org/pdf/2401.00112
Abstract The integration of sensorized vessels, enabling real-time data collection and machine learning-driven data analysis marks a pivotal advancement in the maritime industry. This transformative technology not only can enhance safety, efficiency, and sustainability but also usher in a new era of cost-effective and smart maritime transportation in our increasingly interconnected world. This study presents a deep learning-driven anomaly detection system augmented with interpretable machine learning models for identifying performance anomalies in an industrial sensorized vessel, called TUCANA. We Leverage a human-in-the-loop unsupervised process that involves utilizing standard and Long Short-Term Memory (LSTM) autoencoders augmented with interpretable surrogate models, i.e., random forest and decision tree, to add transparency and interpretability to the results provided by the deep learning models. The interpretable models also enable automated rule generation for translating the inference into human-readable rules. Additionally, the process also includes providing a projection of the results using t-distributed stochastic neighbor embedding (t-SNE), which helps with a better understanding of the structure and relationships within the data and assessment of the identified anomalies. We empirically evaluate the system using real data acquired from the vessel TUCANA and the results involve achieving over 80% precision and 90% recall with the LSTM model used in the process. The interpretable models also provide logical rules aligned with expert thinking, and the t-SNE-based projection enhances interpretability. Our system demonstrates that the proposed approach can be used effectively in real-world scenarios, offering transparency and precision in performance anomaly detection.
Unicron: Economizing Self-Healing LLM Training at Scale
Authors: Authors: Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren Zhou
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2401.00134
Pdf link: https://arxiv.org/pdf/2401.00134
Abstract Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.
SSL-OTA: Unveiling Backdoor Threats in Self-Supervised Learning for Object Detection
Authors: Authors: Qiannan Wang, Changchun Yin, Liming Fang, Lu Zhou, Zhe Liu, Run Wang, Chenhao Lin
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00137
Pdf link: https://arxiv.org/pdf/2401.00137
Abstract The extensive adoption of Self-supervised learning (SSL) has led to an increased security threat from backdoor attacks. While existing research has mainly focused on backdoor attacks in image classification, there has been limited exploration into their implications for object detection. In this work, we propose the first backdoor attack designed for object detection tasks in SSL scenarios, termed Object Transform Attack (SSL-OTA). SSL-OTA employs a trigger capable of altering predictions of the target object to the desired category, encompassing two attacks: Data Poisoning Attack (NA) and Dual-Source Blending Attack (DSBA). NA conducts data poisoning during downstream fine-tuning of the object detector, while DSBA additionally injects backdoors into the pre-trained encoder. We establish appropriate metrics and conduct extensive experiments on benchmark datasets, demonstrating the effectiveness and utility of our proposed attack. Notably, both NA and DSBA achieve high attack success rates (ASR) at extremely low poisoning rates (0.5%). The results underscore the importance of considering backdoor threats in SSL-based object detection and contribute a novel perspective to the field.
TPatch: A Triggered Physical Adversarial Patch
Authors: Authors: Wenjun Zhu, Xiaoyu Ji, Yushi Cheng, Shibo Zhang, Wenyuan Xu
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00148
Pdf link: https://arxiv.org/pdf/2401.00148
Abstract Autonomous vehicles increasingly utilize the vision-based perception module to acquire information about driving environments and detect obstacles. Correct detection and classification are important to ensure safe driving decisions. Existing works have demonstrated the feasibility of fooling the perception models such as object detectors and image classifiers with printed adversarial patches. However, most of them are indiscriminately offensive to every passing autonomous vehicle. In this paper, we propose TPatch, a physical adversarial patch triggered by acoustic signals. Unlike other adversarial patches, TPatch remains benign under normal circumstances but can be triggered to launch a hiding, creating or altering attack by a designed distortion introduced by signal injection attacks towards cameras. To avoid the suspicion of human drivers and make the attack practical and robust in the real world, we propose a content-based camouflage method and an attack robustness enhancement method to strengthen it. Evaluations with three object detectors, YOLO V3/V5 and Faster R-CNN, and eight image classifiers demonstrate the effectiveness of TPatch in both the simulation and the real world. We also discuss possible defenses at the sensor, algorithm, and system levels.
CamPro: Camera-based Anti-Facial Recognition
Authors: Authors: Wenjun Zhu, Yuan Sun, Jiani Liu, Yushi Cheng, Xiaoyu Ji, Wenyuan Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2401.00151
Pdf link: https://arxiv.org/pdf/2401.00151
Abstract The proliferation of images captured from millions of cameras and the advancement of facial recognition (FR) technology have made the abuse of FR a severe privacy threat. Existing works typically rely on obfuscation, synthesis, or adversarial examples to modify faces in images to achieve anti-facial recognition (AFR). However, the unmodified images captured by camera modules that contain sensitive personally identifiable information (PII) could still be leaked. In this paper, we propose a novel approach, CamPro, to capture inborn AFR images. CamPro enables well-packed commodity camera modules to produce images that contain little PII and yet still contain enough information to support other non-sensitive vision applications, such as person detection. Specifically, CamPro tunes the configuration setup inside the camera image signal processor (ISP), i.e., color correction matrix and gamma correction, to achieve AFR, and designs an image enhancer to keep the image quality for possible human viewers. We implemented and validated CamPro on a proof-of-concept camera, and our experiments demonstrate its effectiveness on ten state-of-the-art black-box FR models. The results show that CamPro images can significantly reduce face identification accuracy to 0.3\% while having little impact on the targeted non-sensitive vision application. Furthermore, we find that CamPro is resilient to adaptive attackers who have re-trained their FR models using images generated by CamPro, even with full knowledge of privacy-preserving ISP parameters.
AClassiHonk: A System Framework to Annotate and Classify Vehicular Honk from Road Traffic
Authors: Authors: Biswajit Maitya, Abdul Alima, Popuri Sree Rama Charana, Amlan Chakrabartib, Subrata Nandia, Sanghita Bhattacharjeea
Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2401.00154
Pdf link: https://arxiv.org/pdf/2401.00154
Abstract Recent studies emphasize that vehicular honking contributes to over 50% of noise pollution in developing urban and suburban areas. Frequent honking negatively impacts health, road safety, and the environment. Recognizing and classifying different vehicle honks could offer valuable insights into environmental noise pollution. Existing research on outdoor sound classification and honk detection lacks the ability to classify honks based on vehicle types, limiting contextual information inference for locations, areas, or traffic. Therefore, it becomes imperative to design a system that can detect and classify honks of different types of vehicles from which we can infer some contextual information. In this paper, we have developed a novel framework AClassiHonk that performs raw vehicular honk sensing, data labeling and classifies the honk into three major groups, i.e., light-weight vehicles, medium-weight vehicles, and heavy-weight vehicles. We collected the raw audio samples of different vehicular honking based on spatio-temporal characteristics and converted them into spectrogram images. We have proposed a deep learning-based Multi-label Autoencoder model (MAE) for automated labeling of the unlabeled data samples, which provides 97.64% accuracy in contrast to existing deep learning-based data labeling methods. Further, we have used various pre-trained models, namely Inception V3, ResNet50, MobileNet, ShuffleNet, and proposed an Ensembled Transfer Learning model (EnTL) for vehicle honks classification and performed comparative analysis. Results reveal that EnTL exhibits the best performance compared to pre-trained models and achieves 96.72% accuracy in our dataset. In addition, we have identified a context of a location based on these classified honk signatures in a city.
Addressing Trust Challenges in Blockchain Oracles Using Asymmetric Byzantine Quorums
Authors: Authors: Fahad Rahman, Chafiq Titouna, Farid Nait-Abdesselam
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2401.00175
Pdf link: https://arxiv.org/pdf/2401.00175
Abstract Distributed Computing in Blockchain Technology (BCT) hinges on a trust assumption among independent nodes. Without a third-party interface or what is known as a Blockchain Oracle, it can not interact with the external world. This Oracle plays a crucial role by feeding extrinsic data into the Blockchain, ensuring that Smart Contracts operate accurately in real time. The Oracle problem arises from the inherent difficulty in verifying the truthfulness of the data sourced by these Oracles. The genuineness of a Blockchain Oracle is paramount, as it directly influences the Blockchain's reliability, credibility, and scalability. To tackle these challenges, a strategy rooted in Byzantine fault tolerance {\phi} is introduced. Furthermore, an autonomous system for sustainability and audibility, built on heuristic detection, is put forth. The effectiveness and precision of the proposed strategy outperformed existing methods using two real-world datasets, aimed to meet the authenticity standards for Blockchain Oracles.
Auxiliary Network-Enabled Attack Detection and Resilient Control of Islanded AC Microgrid
Authors: Authors: Vaibhav Vaishnav, Anoop Jain, Dushyant Sharma
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2401.00180
Pdf link: https://arxiv.org/pdf/2401.00180
Abstract This paper proposes a cyber-resilient distributed control strategy equipped with attack detection capabilities for islanded AC microgrids in the presence of bounded stealthy cyber attacks affecting both frequency and power information exchanged among neighboring distributed generators (DGs). The proposed control methodology relies on the construction of an auxiliary layer and the establishment of effective inter-layer cooperation between the actual DGs in the control layer and the virtual DGs in the auxiliary layer. This cooperation aims to achieve robust frequency restoration and proportional active power-sharing. It is shown that the in situ presence of a concealed auxiliary layer not only guarantees resilience against stealthy bounded attacks on both frequency and power-sharing but also facilitates a network-enabled attack identification mechanism. The paper provides rigorous proof of the stability of the closed-loop system and derives bounds for frequency and power deviations under attack conditions, offering insights into the impact of the attack signal, control and pinning gains, and network connectivity on the system's convergence properties. The performance of the proposed controllers is illustrated by simulating a networked islanded AC microgrid in a Simulink environment showcasing both attributes of attack resilience and attack detection.
A Novel Approach for Defect Detection of Wind Turbine Blade Using Virtual Reality and Deep Learning
Authors: Authors: Md Fazle Rabbi, Solayman Hossain Emon, Ehtesham Mahmud Nishat, Tzu-Liang (Bill)Tseng, Atira Ferdoushi, Chun-Che Huang, Md Fashiar Rahman
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2401.00237
Pdf link: https://arxiv.org/pdf/2401.00237
Abstract Wind turbines are subjected to continuous rotational stresses and unusual external forces such as storms, lightning, strikes by flying objects, etc., which may cause defects in turbine blades. Hence, it requires a periodical inspection to ensure proper functionality and avoid catastrophic failure. The task of inspection is challenging due to the remote location and inconvenient reachability by human inspection. Researchers used images with cropped defects from the wind turbine in the literature. They neglected possible background biases, which may hinder real-time and autonomous defect detection using aerial vehicles such as drones or others. To overcome such challenges, in this paper, we experiment with defect detection accuracy by having the defects with the background using a two-step deep-learning methodology. In the first step, we develop virtual models of wind turbines to synthesize the near-reality images for four types of common defects - cracks, leading edge erosion, bending, and light striking damage. The Unity perception package is used to generate wind turbine blade defects images with variations in background, randomness, camera angle, and light effects. In the second step, a customized U-Net architecture is trained to classify and segment the defect in turbine blades. The outcomes of U-Net architecture have been thoroughly tested and compared with 5-fold validation datasets. The proposed methodology provides reasonable defect detection accuracy, making it suitable for autonomous and remote inspection through aerial vehicles.
Predicting Evoked Emotions in Conversations
Authors: Authors: Enas Altarawneh, Ameeta Agrawal, Michael Jenkin, Manos Papagelis
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.00383
Pdf link: https://arxiv.org/pdf/2401.00383
Abstract Understanding and predicting the emotional trajectory in multi-party multi-turn conversations is of great significance. Such information can be used, for example, to generate empathetic response in human-machine interaction or to inform models of pre-emptive toxicity detection. In this work, we introduce the novel problem of Predicting Emotions in Conversations (PEC) for the next turn (n+1), given combinations of textual and/or emotion input up to turn n. We systematically approach the problem by modeling three dimensions inherently connected to evoked emotions in dialogues, including (i) sequence modeling, (ii) self-dependency modeling, and (iii) recency modeling. These modeling dimensions are then incorporated into two deep neural network architectures, a sequence model and a graph convolutional network model. The former is designed to capture the sequence of utterances in a dialogue, while the latter captures the sequence of utterances and the network formation of multi-party dialogues. We perform a comprehensive empirical evaluation of the various proposed models for addressing the PEC problem. The results indicate (i) the importance of the self-dependency and recency model dimensions for the prediction task, (ii) the quality of simpler sequence models in short dialogues, (iii) the importance of the graph neural models in improving the predictions in long dialogues.
Horizontal Federated Computer Vision
Authors: Authors: Paul K. Mandal, Cole Leo, Connor Hurley
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.00390
Pdf link: https://arxiv.org/pdf/2401.00390
Abstract In the modern world, the amount of visual data recorded has been rapidly increasing. In many cases, data is stored in geographically distinct locations and thus requires a large amount of time and space to consolidate. Sometimes, there are also regulations for privacy protection which prevent data consolidation. In this work, we present federated implementations for object detection and recognition using a federated Faster R-CNN (FRCNN) and image segmentation using a federated Fully Convolutional Network (FCN). Our FRCNN was trained on 5000 examples of the COCO2017 dataset while our FCN was trained on the entire train set of the CamVid dataset. The proposed federated models address the challenges posed by the increasing volume and decentralized nature of visual data, offering efficient solutions in compliance with privacy regulations.
Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Rail Defect Detection
Authors: Authors: Rahatara Ferdousi, Chunsheng Yang, M. Anwar Hossain, Fedwa Laamarti, M. Shamim Hossain, Abdulmotaleb El Saddik
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2401.00393
Pdf link: https://arxiv.org/pdf/2401.00393
Abstract Recent advancements in cognitive computing, with the integration of deep learning techniques, have facilitated the development of intelligent cognitive systems (ICS). This is particularly beneficial in the context of rail defect detection, where the ICS would emulate human-like analysis of image data for defect patterns. Despite the success of Convolutional Neural Networks (CNN) in visual defect classification, the scarcity of large datasets for rail defect detection remains a challenge due to infrequent accident events that would result in defective parts and images. Contemporary researchers have addressed this data scarcity challenge by exploring rule-based and generative data augmentation models. Among these, Variational Autoencoder (VAE) models can generate realistic data without extensive baseline datasets for noise modeling. This study proposes a VAE-based synthetic image generation technique for rail defects, incorporating weight decay regularization and image reconstruction loss to prevent overfitting. The proposed method is applied to create a synthetic dataset for the Canadian Pacific Railway (CPR) with just 50 real samples across five classes. Remarkably, 500 synthetic samples are generated with a minimal reconstruction loss of 0.021. A Visual Transformer (ViT) model underwent fine-tuning using this synthetic CPR dataset, achieving high accuracy rates (98%-99%) in classifying the five defect classes. This research offers a promising solution to the data scarcity challenge in rail defect detection, showcasing the potential for robust ICS development in this domain.
RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
Authors: Authors: Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Cheng Niu, Randy Zhong, Juntong Song, Tong Zhang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.00396
Pdf link: https://arxiv.org/pdf/2401.00396
Abstract Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in large language models (LLMs). Despite the integration of RAG, LLMs may still present unsupported or contradictory claims to the retrieved contents. In order to develop effective hallucination prevention strategies under RAG, it is important to create benchmark datasets that can measure the extent of hallucination. This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard RAG frameworks for LLM applications. RAGTruth comprises nearly 18,000 naturally generated responses from diverse LLMs using RAG. These responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. We not only benchmark hallucination frequencies across different LLMs, but also critically assess the effectiveness of several existing hallucination detection methodologies. Furthermore, we show that using a high-quality dataset such as RAGTruth, it is possible to finetune a relatively small LLM and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art large language models such as GPT-4.
Low-cost Geometry-based Eye Gaze Detection using Facial Landmarks Generated through Deep Learning
Authors: Authors: Esther Enhui Ye, John Enzhou Ye, Joseph Ye, Jacob Ye, Runzhou Ye
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00406
Pdf link: https://arxiv.org/pdf/2401.00406
Abstract Introduction: In the realm of human-computer interaction and behavioral research, accurate real-time gaze estimation is critical. Traditional methods often rely on expensive equipment or large datasets, which are impractical in many scenarios. This paper introduces a novel, geometry-based approach to address these challenges, utilizing consumer-grade hardware for broader applicability. Methods: We leverage novel face landmark detection neural networks capable of fast inference on consumer-grade chips to generate accurate and stable 3D landmarks of the face and iris. From these, we derive a small set of geometry-based descriptors, forming an 8-dimensional manifold representing the eye and head movements. These descriptors are then used to formulate linear equations for predicting eye-gaze direction. Results: Our approach demonstrates the ability to predict gaze with an angular error of less than 1.9 degrees, rivaling state-of-the-art systems while operating in real-time and requiring negligible computational resources. Conclusion: The developed method marks a significant step forward in gaze estimation technology, offering a highly accurate, efficient, and accessible alternative to traditional systems. It opens up new possibilities for real-time applications in diverse fields, from gaming to psychological research.
Is It Possible to Backdoor Face Forgery Detection with Natural Triggers?
Authors: Authors: Xiaoxuan Han, Songlin Yang, Wei Wang, Ziwen He, Jing Dong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00414
Pdf link: https://arxiv.org/pdf/2401.00414
Abstract Deep neural networks have significantly improved the performance of face forgery detection models in discriminating Artificial Intelligent Generated Content (AIGC). However, their security is significantly threatened by the injection of triggers during model training (i.e., backdoor attacks). Although existing backdoor defenses and manual data selection can mitigate those using human-eye-sensitive triggers, such as patches or adversarial noises, the more challenging natural backdoor triggers remain insufficiently researched. To further investigate natural triggers, we propose a novel analysis-by-synthesis backdoor attack against face forgery detection models, which embeds natural triggers in the latent space. We thoroughly study such backdoor vulnerability from two perspectives: (1) Model Discrimination (Optimization-Based Trigger): we adopt a substitute detection model and find the trigger by minimizing the cross-entropy loss; (2) Data Distribution (Custom Trigger): we manipulate the uncommon facial attributes in the long-tailed distribution to generate poisoned samples without the supervision from detection models. Furthermore, to completely evaluate the detection models towards the latest AIGC, we utilize both state-of-the-art StyleGAN and Stable Diffusion for trigger generation. Finally, these backdoor triggers introduce specific semantic features to the generated poisoned samples (e.g., skin textures and smile), which are more natural and robust. Extensive experiments show that our method is superior from three levels: (1) Attack Success Rate: ours achieves a high attack success rate (over 99%) and incurs a small model accuracy drop (below 0.2%) with a low poisoning rate (less than 3%); (2) Backdoor Defense: ours shows better robust performance when faced with existing backdoor defense methods; (3) Human Inspection: ours is less human-eye-sensitive from a comprehensive user study.
From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion
Authors: Authors: Xingyuan Li, Yang Zou, Jinyuan Liu, Zhiying Jiang, Long Ma, Xin Fan, Risheng Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00421
Pdf link: https://arxiv.org/pdf/2401.00421
Abstract With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection
Authors: Authors: Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, Ruifeng Xu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.00424
Pdf link: https://arxiv.org/pdf/2401.00424
Abstract Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.
RainSD: Rain Style Diversification Module for Image Synthesis Enhancement using Feature-Level Style Distribution
Authors: Authors: Hyeonjae Jeon, Junghyun Seo, Taesoo Kim, Sungho Son, Jungki Lee, Gyeungho Choi, Yongseob Lim
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00460
Pdf link: https://arxiv.org/pdf/2401.00460
Abstract Autonomous driving technology nowadays targets to level 4 or beyond, but the researchers are faced with some limitations for developing reliable driving algorithms in diverse challenges. To promote the autonomous vehicles to spread widely, it is important to address safety issues on this technology. Among various safety concerns, the sensor blockage problem by severe weather conditions can be one of the most frequent threats for multi-task learning based perception algorithms during autonomous driving. To handle this problem, the importance of the generation of proper datasets is becoming more significant. In this paper, a synthetic road dataset with sensor blockage generated from real road dataset BDD100K is suggested in the format of BDD100K annotation. Rain streaks for each frame were made by an experimentally established equation and translated utilizing the image-to-image translation network based on style transfer. Using this dataset, the degradation of the diverse multi-task networks for autonomous driving, such as lane detection, driving area segmentation, and traffic object detection, has been thoroughly evaluated and analyzed. The tendency of the performance degradation of deep neural network-based perception systems for autonomous vehicle has been analyzed in depth. Finally, we discuss the limitation and the future directions of the deep neural network-based perception algorithms and autonomous driving dataset generation based on image-to-image translation.
Blockchain and Deep Learning-Based IDS for Securing SDN-Enabled Industrial IoT Environments
Authors: Authors: Samira Kamali Poorazad, Chafika Benzaıd, Tarik Taleb
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2401.00468
Pdf link: https://arxiv.org/pdf/2401.00468
Abstract The industrial Internet of Things (IIoT) involves the integration of Internet of Things (IoT) technologies into industrial settings. However, given the high sensitivity of the industry to the security of industrial control system networks and IIoT, the use of software-defined networking (SDN) technology can provide improved security and automation of communication processes. Despite this, the architecture of SDN can give rise to various security threats. Therefore, it is of paramount importance to consider the impact of these threats on SDN-based IIoT environments. Unlike previous research, which focused on security in IIoT and SDN architectures separately, we propose an integrated method including two components that work together seamlessly for better detecting and preventing security threats associated with SDN-based IIoT architectures. The two components consist in a convolutional neural network-based Intrusion Detection System (IDS) implemented as an SDN application and a Blockchain-based system (BS) to empower application layer and network layer security, respectively. A significant advantage of the proposed method lies in jointly minimizing the impact of attacks such as command injection and rule injection on SDN-based IIoT architecture layers. The proposed IDS exhibits superior classification accuracy in both binary and multiclass categories.
A Reliable Knowledge Processing Framework for Combustion Science using Foundation Models
Authors: Authors: Vansh Sharma, Venkat Raman
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.00544
Pdf link: https://arxiv.org/pdf/2401.00544
Abstract This research explores the integration of large language models (LLMs) into scientific data assimilation, focusing on combustion science as a case study. Leveraging foundational models integrated with Retrieval-Augmented Generation (RAG) framework, the study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature. The multifaceted nature of combustion research emphasizes the critical role of knowledge processing in navigating and extracting valuable information from a vast and diverse pool of sources. The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy. It incorporates prompt engineering and offline open-source LLMs, offering user autonomy in selecting base models. The study provides a thorough examination of text segmentation strategies, conducts comparative studies between LLMs, and explores various optimized prompts to demonstrate the effectiveness of the framework. By incorporating an external database, the framework outperforms a conventional LLM in generating accurate responses and constructing robust arguments. Additionally, the study delves into the investigation of optimized prompt templates for the purpose of efficient extraction of scientific literature. The research addresses concerns related to hallucinations and false research articles by introducing a custom workflow developed with a detection algorithm to filter out inaccuracies. Despite identified areas for improvement, the framework consistently delivers accurate domain-specific responses with minimal human oversight. The prompt-agnostic approach introduced holds promise for future deliberations. The study underscores the significance of integrating LLMs and knowledge processing techniques in scientific research, providing a foundation for advancements in data assimilation and utilization.
Credible Teacher for Semi-Supervised Object Detection in Open Scene
Authors: Authors: Jingyu Zhuang, Kuo Wang, Liang Lin, Guanbin Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00695
Pdf link: https://arxiv.org/pdf/2401.00695
Abstract Semi-Supervised Object Detection (SSOD) has achieved resounding success by leveraging unlabeled data to improve detection performance. However, in Open Scene Semi-Supervised Object Detection (O-SSOD), unlabeled data may contains unknown objects not observed in the labeled data, which will increase uncertainty in the model's predictions for known objects. It is detrimental to the current methods that mainly rely on self-training, as more uncertainty leads to the lower localization and classification precision of pseudo labels. To this end, we propose Credible Teacher, an end-to-end framework. Credible Teacher adopts an interactive teaching mechanism using flexible labels to prevent uncertain pseudo labels from misleading the model and gradually reduces its uncertainty through the guidance of other credible pseudo labels. Empirical results have demonstrated our method effectively restrains the adverse effect caused by O-SSOD and significantly outperforms existing counterparts.
The Earth is Flat? Unveiling Factual Errors in Large Language Models
Authors: Authors: Wenxuan Wang, Juluan Shi, Zhaopeng Tu, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.00761
Pdf link: https://arxiv.org/pdf/2401.00761
Abstract Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by test data leakage or the need for extensive human labor, hindering efficient and accurate error detection. To tackle this problem, we introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs. This framework involves three main steps: First, it constructs a factual knowledge graph by retrieving fact triplets from a large-scale knowledge database. Then, leveraging the knowledge graph, FactChecker employs a rule-based approach to generates three types of questions (Yes-No, Multiple-Choice, and WH questions) that involve single-hop and multi-hop relations, along with correct answers. Lastly, it assesses the LLMs' responses for accuracy using tailored matching strategies for each question type. Our extensive tests on six prominent LLMs, including text-davinci-002, text-davinci-003, ChatGPT~(gpt-3.5-turbo, gpt-4), Vicuna, and LLaMA-2, reveal that FactChecker can trigger factual errors in up to 45\% of questions in these models. Moreover, we demonstrate that FactChecker's test cases can improve LLMs' factual accuracy through in-context learning and fine-tuning (e.g., llama-2-13b-chat's accuracy increase from 35.3\% to 68.5\%). We are making all code, data, and results available for future research endeavors.
Unsupervised Outlier Detection using Random Subspace and Subsampling Ensembles of Dirichlet Process Mixtures
Authors: Authors: Dongwook Kim, Juyeon Park, Hee Cheol Chung, Seonghyun Jeong
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2401.00773
Pdf link: https://arxiv.org/pdf/2401.00773
Abstract Probabilistic mixture models are acknowledged as a valuable tool for unsupervised outlier detection owing to their interpretability and intuitive grounding in statistical principles. Within this framework, Dirichlet process mixture models emerge as a compelling alternative to conventional finite mixture models for both clustering and outlier detection tasks. However, despite their evident advantages, the widespread adoption of Dirichlet process mixture models in unsupervised outlier detection has been hampered by challenges related to computational inefficiency and sensitivity to outliers during the construction of detectors. To tackle these challenges, we propose a novel outlier detection method based on ensembles of Dirichlet process Gaussian mixtures. The proposed method is a fully unsupervised algorithm that capitalizes on random subspace and subsampling ensembles, not only ensuring efficient computation but also enhancing the robustness of the resulting outlier detector. Moreover, the proposed method leverages variational inference for Dirichlet process mixtures to ensure efficient and fast computation. Empirical studies with benchmark datasets demonstrate that our method outperforms existing approaches for unsupervised outlier detection.
Keyword: face recognition

Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition
Authors: Authors: Ruizhuo Xu, Ke Wang, Chao Deng, Mei Wang, Xi Chen, Wenhui Huang, Junlan Feng, Weihong Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.00719
Pdf link: https://arxiv.org/pdf/2401.00719
Abstract With the increasing availability of consumer depth sensors, 3D face recognition (FR) has attracted more and more attention. However, the data acquired by these sensors are often coarse and noisy, making them impractical to use directly. In this paper, we introduce an innovative Depth map denoising network (DMDNet) based on the Denoising Implicit Image Function (DIIF) to reduce noise and enhance the quality of facial depth images for low-quality 3D FR. After generating clean depth faces using DMDNet, we further design a powerful recognition network called Lightweight Depth and Normal Fusion network (LDNFNet), which incorporates a multi-branch fusion block to learn unique and complementary features between different modalities such as depth and normal images. Comprehensive experiments conducted on four distinct low-quality databases demonstrate the effectiveness and robustness of our proposed methods. Furthermore, when combining DMDNet and LDNFNet, we achieve state-of-the-art results on the Lock3DFace database.
Keyword: augmentation

A comprehensive framework for occluded human pose estimation
Authors: Authors: Linhao Xu, Lin Zhao, Xinxin Sun, Guangyu Li, Kedong Yan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00155
Pdf link: https://arxiv.org/pdf/2401.00155
Abstract Occlusion presents a significant challenge in human pose estimation. The challenges posed by occlusion can be attributed to the following factors: 1) Data: The collection and annotation of occluded human pose samples are relatively challenging. 2) Feature: Occlusion can cause feature confusion due to the high similarity between the target person and interfering individuals. 3) Inference: Robust inference becomes challenging due to the loss of complete body structural information. The existing methods designed for occluded human pose estimation usually focus on addressing only one of these factors. In this paper, we propose a comprehensive framework DAG (Data, Attention, Graph) to address the performance degradation caused by occlusion. Specifically, we introduce the mask joints with instance paste data augmentation technique to simulate occlusion scenarios. Additionally, an Adaptive Discriminative Attention Module (ADAM) is proposed to effectively enhance the features of target individuals. Furthermore, we present the Feature-Guided Multi-Hop GCN (FGMP-GCN) to fully explore the prior knowledge of body structure and improve pose estimation results. Through extensive experiments conducted on three benchmark datasets for occluded human pose estimation, we demonstrate that the proposed method outperforms existing methods. Code and data will be publicly available.
Probing the Limits and Capabilities of Diffusion Models for the Anatomic Editing of Digital Twins
Authors: Authors: Karim Kadry, Shreya Gupta, Farhad R. Nezami, Elazer R. Edelman
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2401.00247
Pdf link: https://arxiv.org/pdf/2401.00247
Abstract Numerical simulations can model the physical processes that govern cardiovascular device deployment. When such simulations incorporate digital twins; computational models of patient-specific anatomy, they can expedite and de-risk the device design process. Nonetheless, the exclusive use of patient-specific data constrains the anatomic variability which can be precisely or fully explored. In this study, we investigate the capacity of Latent Diffusion Models (LDMs) to edit digital twins to create anatomic variants, which we term digital siblings. Digital twins and their corresponding siblings can serve as the basis for comparative simulations, enabling the study of how subtle anatomic variations impact the simulated deployment of cardiovascular devices, as well as the augmentation of virtual cohorts for device assessment. However, while diffusion models have been characterized in their ability to edit natural images, their capacity to anatomically edit digital twins has yet to be studied. Using a case example centered on 3D digital twins of cardiac anatomy, we implement various methods for generating digital siblings and characterize them through morphological and topological analyses. We specifically edit digital twins to introduce anatomic variation at different spatial scales and within localized regions, demonstrating the existence of bias towards common anatomic features. We further show that such anatomic bias can be leveraged for virtual cohort augmentation through selective editing, partially alleviating issues related to dataset imbalance and lack of diversity. Our experimental framework thus delineates the limits and capabilities of using latent diffusion models in synthesizing anatomic variation for in silico trials.
SHARE: Single-view Human Adversarial REconstruction
Authors: Authors: Shreelekha Revankar, Shijia Liao, Yu Shen, Junbang Liang, Huaishu Peng, Ming Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.00343
Pdf link: https://arxiv.org/pdf/2401.00343
Abstract The accuracy of 3D Human Pose and Shape reconstruction (HPS) from an image is progressively improving. Yet, no known method is robust across all image distortion. To address issues due to variations of camera poses, we introduce SHARE, a novel fine-tuning method that utilizes adversarial data augmentation to enhance the robustness of existing HPS techniques. We perform a comprehensive analysis on the impact of camera poses on HPS reconstruction outcomes. We first generated large-scale image datasets captured systematically from diverse camera perspectives. We then established a mapping between camera poses and reconstruction errors as a continuous function that characterizes the relationship between camera poses and HPS quality. Leveraging this representation, we introduce RoME (Regions of Maximal Error), a novel sampling technique for our adversarial fine-tuning method. The SHARE framework is generalizable across various single-view HPS methods and we demonstrate its performance on HMR, SPIN, PARE, CLIFF and ExPose. Our results illustrate a reduction in mean joint errors across single-view HPS techniques, for images captured from multiple camera positions without compromising their baseline performance. In many challenging cases, our method surpasses the performance of existing models, highlighting its practical significance for diverse real-world applications.
Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Rail Defect Detection
Authors: Authors: Rahatara Ferdousi, Chunsheng Yang, M. Anwar Hossain, Fedwa Laamarti, M. Shamim Hossain, Abdulmotaleb El Saddik
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2401.00393
Pdf link: https://arxiv.org/pdf/2401.00393
Abstract Recent advancements in cognitive computing, with the integration of deep learning techniques, have facilitated the development of intelligent cognitive systems (ICS). This is particularly beneficial in the context of rail defect detection, where the ICS would emulate human-like analysis of image data for defect patterns. Despite the success of Convolutional Neural Networks (CNN) in visual defect classification, the scarcity of large datasets for rail defect detection remains a challenge due to infrequent accident events that would result in defective parts and images. Contemporary researchers have addressed this data scarcity challenge by exploring rule-based and generative data augmentation models. Among these, Variational Autoencoder (VAE) models can generate realistic data without extensive baseline datasets for noise modeling. This study proposes a VAE-based synthetic image generation technique for rail defects, incorporating weight decay regularization and image reconstruction loss to prevent overfitting. The proposed method is applied to create a synthetic dataset for the Canadian Pacific Railway (CPR) with just 50 real samples across five classes. Remarkably, 500 synthetic samples are generated with a minimal reconstruction loss of 0.021. A Visual Transformer (ViT) model underwent fine-tuning using this synthetic CPR dataset, achieving high accuracy rates (98%-99%) in classifying the five defect classes. This research offers a promising solution to the data scarcity challenge in rail defect detection, showcasing the potential for robust ICS development in this domain.
SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection
Authors: Authors: Shijue Huang, Libo Qin, Bingbing Wang, Geng Tu, Ruifeng Xu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.00424
Pdf link: https://arxiv.org/pdf/2401.00424
Abstract Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.
Predicting Anti-microbial Resistance using Large Language Models
Authors: Authors: Hyunwoo Yoo, Bahrad Sokhansanj, James R. Brown, Gail Rosen
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.00642
Pdf link: https://arxiv.org/pdf/2401.00642
Abstract During times of increasing antibiotic resistance and the spread of infectious diseases like COVID-19, it is important to classify genes related to antibiotic resistance. As natural language processing has advanced with transformer-based language models, many language models that learn characteristics of nucleotide sequences have also emerged. These models show good performance in classifying various features of nucleotide sequences. When classifying nucleotide sequences, not only the sequence itself, but also various background knowledge is utilized. In this study, we use not only a nucleotide sequence-based language model but also a text language model based on PubMed articles to reflect more biological background knowledge in the model. We propose a method to fine-tune the nucleotide sequence language model and the text language model based on various databases of antibiotic resistance genes. We also propose an LLM-based augmentation technique to supplement the data and an ensemble method to effectively combine the two models. We also propose a benchmark for evaluating the model. Our method achieved better performance than the nucleotide sequence language model in the drug resistance class prediction.
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Authors: Authors: Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.00662
Pdf link: https://arxiv.org/pdf/2401.00662
Abstract Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.
Refining Pre-Trained Motion Models
Authors: Authors: Xinglong Sun, Adam W. Harley, Leonidas J. Guibas
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.00850
Pdf link: https://arxiv.org/pdf/2401.00850
Abstract Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms, and methods that encourage cycle-consistency in the estimates (i.e., tracking backwards should yield the opposite trajectory as tracking forwards). In this work, we take on the challenge of improving state-of-the-art supervised models with self-supervised training. We find that when the initialization is supervised weights, most existing self-supervision techniques actually make performance worse instead of better, which suggests that the benefit of seeing the new data is overshadowed by the noise in the training signal. Focusing on obtaining a clean'' training signal from real-world unlabelled video, we propose to separate label-making and training into two distinct stages. In the first stage, we use the pre-trained model to estimate motion in a video, and then select the subset of motion estimates which we can verify with cycle-consistency. This produces a sparse but accurate pseudo-labelling of the video. In the second stage, we fine-tune the model to reproduce these outputs, while also applying augmentations on the input. We complement this boot-strapping method with simple techniques that densify and re-balance the pseudo-labels, ensuring that we do not merely train oneasy'' tracks. We show that our method yields reliable gains over fully-supervised methods in real videos, for both short-term (flow-based) and long-range (multi-frame) pixel tracking.

LeeKyungwook / get-arxiv-noti

New submissions for Tue, 2 Jan 24 #915

Keyword: detection

6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Generating Enhanced Negatives for Training Language-Based Object Detectors

Enabling Smart Retrofitting and Performance Anomaly Detection for a Sensorized Vessel: A Maritime Industry Experience

Unicron: Economizing Self-Healing LLM Training at Scale

SSL-OTA: Unveiling Backdoor Threats in Self-Supervised Learning for Object Detection

TPatch: A Triggered Physical Adversarial Patch

CamPro: Camera-based Anti-Facial Recognition

AClassiHonk: A System Framework to Annotate and Classify Vehicular Honk from Road Traffic

Addressing Trust Challenges in Blockchain Oracles Using Asymmetric Byzantine Quorums

Auxiliary Network-Enabled Attack Detection and Resilient Control of Islanded AC Microgrid

A Novel Approach for Defect Detection of Wind Turbine Blade Using Virtual Reality and Deep Learning

Predicting Evoked Emotions in Conversations

Horizontal Federated Computer Vision

Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Rail Defect Detection

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Low-cost Geometry-based Eye Gaze Detection using Facial Landmarks Generated through Deep Learning

Is It Possible to Backdoor Face Forgery Detection with Natural Triggers?

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion

SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection

RainSD: Rain Style Diversification Module for Image Synthesis Enhancement using Feature-Level Style Distribution

Blockchain and Deep Learning-Based IDS for Securing SDN-Enabled Industrial IoT Environments

A Reliable Knowledge Processing Framework for Combustion Science using Foundation Models

Credible Teacher for Semi-Supervised Object Detection in Open Scene

The Earth is Flat? Unveiling Factual Errors in Large Language Models

Unsupervised Outlier Detection using Random Subspace and Subsampling Ensembles of Dirichlet Process Mixtures

Keyword: face recognition

Depth Map Denoising Network and Lightweight Fusion Network for Enhanced 3D Face Recognition

Keyword: augmentation

A comprehensive framework for occluded human pose estimation

Probing the Limits and Capabilities of Diffusion Models for the Anatomic Editing of Digital Twins

SHARE: Single-view Human Adversarial REconstruction

Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Rail Defect Detection

SDIF-DA: A Shallow-to-Deep Interaction Framework with Data Augmentation for Multi-modal Intent Detection

Predicting Anti-microbial Resistance using Large Language Models

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Refining Pre-Trained Motion Models