Abstract
Multimodal Conversational Emotion (MCE) detection, generally spanning across the acoustic, vision and language modalities, has attracted increasing interest in the multimedia community. Previous studies predominantly focus on learning contextual information in conversations with only a few considering the topic information in single language modality, while always neglecting the acoustic and vision topic information. On this basis, we propose a model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks. Particularly, we integrate the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information. Detailed evaluations demonstrate the significant improvements of TopicDiff over the state-of-the-art MCE baselines, justifying the importance of multimodal topic information to MCE and the effectiveness of TopicDiff in capturing such information. Furthermore, we observe an interesting finding that the topic information in acoustic and vision is more discriminative and robust compared to the language.
A Data-Driven Two-Phase Multi-Split Causal Ensemble Model for Time Series
Authors: Authors: Zhipeng Ma, Marco Kemmerling, Daniel Buschmann, Chrismarie Enslin, Daniel Lütticke, Robert H. Schmitt
Abstract
Causal inference is a fundamental research topic for discovering the cause-effect relationships in many disciplines. However, not all algorithms are equally well-suited for a given dataset. For instance, some approaches may only be able to identify linear relationships, while others are applicable for non-linearities. Algorithms further vary in their sensitivity to noise and their ability to infer causal information from coupled vs. non-coupled time series. Therefore, different algorithms often generate different causal relationships for the same input. To achieve a more robust causal inference result, this publication proposes a novel data-driven two-phase multi-split causal ensemble model to combine the strengths of different causality base algorithms. In comparison to existing approaches, the proposed ensemble method reduces the influence of noise through a data partitioning scheme in the first phase. To achieve this, the data are initially divided into several partitions and the base algorithms are applied to each partition. Subsequently, Gaussian mixture models are used to identify the causal relationships derived from the different partitions that are likely to be valid. In the second phase, the identified relationships from each base algorithm are then merged based on three combination rules. The proposed ensemble approach is evaluated using multiple metrics, among them a newly developed evaluation index for causal ensemble approaches. We perform experiments using three synthetic datasets with different volumes and complexity, which are specifically designed to test causality detection methods under different circumstances while knowing the ground truth causal relationships. In these experiments, our causality ensemble outperforms each of its base algorithms. In practical applications, the use of the proposed method could hence lead to more robust and reliable causality results.
Enhancing Security in Federated Learning through Adaptive Consensus-Based Model Update Validation
Authors: Authors: Zahir Alsulaimawi
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Abstract
This paper introduces an advanced approach for fortifying Federated Learning (FL) systems against label-flipping attacks. We propose a simplified consensus-based verification process integrated with an adaptive thresholding mechanism. This dynamic thresholding is designed to adjust based on the evolving landscape of model updates, offering a refined layer of anomaly detection that aligns with the real-time needs of distributed learning environments. Our method necessitates a majority consensus among participating clients to validate updates, ensuring that only vetted and consensual modifications are applied to the global model. The efficacy of our approach is validated through experiments on two benchmark datasets in deep learning, CIFAR-10 and MNIST. Our results indicate a significant mitigation of label-flipping attacks, bolstering the FL system's resilience. This method transcends conventional techniques that depend on anomaly detection or statistical validation by incorporating a verification layer reminiscent of blockchain's participatory validation without the associated cryptographic overhead. The innovation of our approach rests in striking an optimal balance between heightened security measures and the inherent limitations of FL systems, such as computational efficiency and data privacy. Implementing a consensus mechanism specifically tailored for FL environments paves the way for more secure, robust, and trustworthy distributed machine learning applications, where safeguarding data integrity and model robustness is critical.
Investigation of the Impact of Synthetic Training Data in the Industrial Application of Terminal Strip Object Detection
Authors: Authors: Nico Baumgart, Markus Lange-Hegermann, Mike Mücke
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
In industrial manufacturing, numerous tasks of visually inspecting or detecting specific objects exist that are currently performed manually or by classical image processing methods. Therefore, introducing recent deep learning models to industrial environments holds the potential to increase productivity and enable new applications. However, gathering and labeling sufficient data is often intractable, complicating the implementation of such projects. Hence, image synthesis methods are commonly used to generate synthetic training data from 3D models and annotate them automatically, although it results in a sim-to-real domain gap. In this paper, we investigate the sim-to-real generalization performance of standard object detectors on the complex industrial application of terminal strip object detection. Combining domain randomization and domain knowledge, we created an image synthesis pipeline for automatically generating the training data. Moreover, we manually annotated 300 real images of terminal strips for the evaluation. The results show the cruciality of the objects of interest to have the same scale in either domain. Nevertheless, under optimized scaling conditions, the sim-to-real performance difference in mean average precision amounts to 2.69 % for RetinaNet and 0.98 % for Faster R-CNN, qualifying this approach for industrial requirements.
Divide and Conquer: High-Resolution Industrial Anomaly Detection via Memory Efficient Tiled Ensemble
Authors: Authors: Blaž Rolih, Samet Akçay, Dick Ameln, Ashwin Vaidya
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Industrial anomaly detection is an important task within computer vision with a wide range of practical use cases. The small size of anomalous regions in many real-world datasets necessitates processing the images at a high resolution. This frequently poses significant challenges concerning memory consumption during the model training and inference stages, leaving some existing methods impractical for widespread adoption. To overcome this challenge, we present the tiled ensemble approach, which reduces memory consumption by dividing the input images into a grid of tiles and training a dedicated model for each tile location. The tiled ensemble is compatible with any existing anomaly detection model without the need for any modification of the underlying architecture. By introducing overlapping tiles, we utilize the benefits of traditional stacking ensembles, leading to further improvements in anomaly detection capabilities beyond high resolution alone. We perform a comprehensive analysis using diverse underlying architectures, including Padim, PatchCore, FastFlow, and Reverse Distillation, on two standard anomaly detection datasets: MVTec and VisA. Our method demonstrates a notable improvement across setups while remaining within GPU memory constraints, consuming only as much GPU memory as a single model needs to process a single tile.
ActFormer: Scalable Collaborative Perception via Active Queries
Abstract
Collaborative perception leverages rich visual observations from multiple robots to extend a single robot's perception ability beyond its field of view. Many prior works receive messages broadcast from all collaborators, leading to a scalability challenge when dealing with a large number of robots and sensors. In this work, we aim to address \textit{scalable camera-based collaborative perception} with a Transformer-based architecture. Our key idea is to enable a single robot to intelligently discern the relevance of the collaborators and their associated cameras according to a learned spatial prior. This proactive understanding of the visual features' relevance does not require the transmission of the features themselves, enhancing both communication and computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs. Each BEV query can actively select relevant cameras for information aggregation based on pose information, instead of interacting with all cameras indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89% to 45.15% in terms of AP@0.7 with about 50% fewer queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object detection.
Robust Surgical Tool Tracking with Pixel-based Probabilities for Projected Geometric Primitives
Authors: Authors: Christopher D'Ambrosia, Florian Richter, Zih-Yun Chiu, Nikhil Shinde, Fei Liu, Henrik I. Christensen, Michael C. Yip
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Controlling robotic manipulators via visual feedback requires a known coordinate frame transformation between the robot and the camera. Uncertainties in mechanical systems as well as camera calibration create errors in this coordinate frame transformation. These errors result in poor localization of robotic manipulators and create a significant challenge for applications that rely on precise interactions between manipulators and the environment. In this work, we estimate the camera-to-base transform and joint angle measurement errors for surgical robotic tools using an image based insertion-shaft detection algorithm and probabilistic models. We apply our proposed approach in both a structured environment as well as an unstructured environment and measure to demonstrate the efficacy of our methods.
Node Centrality Approximation For Large Networks Based On Inductive Graph Neural Networks
Authors: Authors: Yiwei Zou, Ting Li, Zong-fu Luo
Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Abstract
Closeness Centrality (CC) and Betweenness Centrality (BC) are crucial metrics in network analysis, providing essential reference for discerning the significance of nodes within complex networks. These measures find wide applications in critical tasks, such as community detection and network dismantling. However, their practical implementation on extensive networks remains computationally demanding due to their high time complexity. To mitigate these computational challenges, numerous approximation algorithms have been developed to expedite the computation of CC and BC. Nevertheless, even these approximations still necessitate substantial processing time when applied to large-scale networks. Furthermore, their output proves sensitive to even minor perturbations within the network structure. In this work, We redefine the CC and BC node ranking problem as a machine learning problem and propose the CNCA-IGE model, which is an encoder-decoder model based on inductive graph neural networks designed to rank nodes based on specified CC or BC metrics. We incorporate the MLP-Mixer model as the decoder in the BC ranking prediction task to enhance the model's robustness and capacity. Our approach is evaluated on diverse synthetic and real-world networks of varying scales, and the experimental results demonstrate that the CNCA-IGE model outperforms state-of-the-art baseline models, significantly reducing execution time while improving performance.
Abstract
Formal methods for guaranteeing that a protocol satisfies a cryptographic security definition have advanced substantially, but such methods are still labor intensive and the need remains for an automated tool that can positively identify an insecure protocol. In this work, we demonstrate that property-based testing, "run it a bunch of times and see if it breaks", is effective for detecting security bugs in secure protocols. We specifically target Secure Multi-Party Computation (MPC), because formal methods targeting this security definition for bit-model implementations are particularly difficult. Using results from the literature for Probabilistic Programming Languages and statistical inference, we devise a test that can detect various flaws in a bit-level implementation of an MPC protocol. The test is grey-box; it requires only transcripts of randomness consumed by the protocol and of the inputs, outputs, and messages. It successfully detects several different mistakes and biases introduced into two different implementations of the classic GMW protocol. Applied to hundreds of randomly generated protocols, it identifies nearly all of them as insecure. We also include an analysis of the parameters of the test, and discussion of what makes detection of MPC (in)security difficult.
RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features
Authors: Authors: Geonho Bang, Kwangjin Choi, Jisong Kim, Dongsuk Kum, Jun Won Choi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The inherent noisy and sparse characteristics of radar data pose challenges in finding effective representations for 3D object detection. In this paper, we propose RadarDistill, a novel knowledge distillation (KD) method, which can improve the representation of radar data by leveraging LiDAR data. RadarDistill successfully transfers desirable characteristics of LiDAR features into radar features using three key components: Cross-Modality Alignment (CMA), Activation-based Feature Distillation (AFD), and Proposal-based Feature Distillation (PFD). CMA enhances the density of radar features through multiple layers of dilation operations, effectively addressing the challenges of inefficient knowledge transfer from LiDAR to radar. AFD is designed to transfer knowledge from significant areas of the LiDAR features, specifically those regions where activation intensity exceeds a predetermined threshold. PFD guides the radar network to mimic LiDAR network features in the object proposals for accurately detected results while moderating features for misdetected proposals like false positives. Our comparative analyses conducted on the nuScenes datasets demonstrate that RadarDistill achieves state-of-the-art (SOTA) performance for radar-only object detection task, recording 20.5% in mAP and 43.7% in NDS. Also, RadarDistill significantly improves the performance of the camera-radar fusion model.
Simulating Battery-Powered TinyML Systems Optimised using Reinforcement Learning in Image-Based Anomaly Detection
Abstract
Advances in Tiny Machine Learning (TinyML) have bolstered the creation of smart industry solutions, including smart agriculture, healthcare and smart cities. Whilst related research contributes to enabling TinyML solutions on constrained hardware, there is a need to amplify real-world applications by optimising energy consumption in battery-powered systems. The work presented extends and contributes to TinyML research by optimising battery-powered image-based anomaly detection Internet of Things (IoT) systems. Whilst previous work in this area has yielded the capabilities of on-device inferencing and training, there has yet to be an investigation into optimising the management of such capabilities using machine learning approaches, such as Reinforcement Learning (RL), to improve the deployment battery life of such systems. Using modelled simulations, the battery life effects of an RL algorithm are benchmarked against static and dynamic optimisation approaches, with the foundation laid for a hardware benchmark to follow. It is shown that using RL within a TinyML-enabled IoT system to optimise the system operations, including cloud anomaly processing and on-device training, yields an improved battery life of 22.86% and 10.86% compared to static and dynamic optimisation approaches respectively. The proposed solution can be deployed to resource-constrained hardware, given its low memory footprint of 800 B, which could be further reduced. This further facilitates the real-world deployment of such systems, including key sectors such as smart agriculture.
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis
Abstract
In this paper, we present an empirical study introducing a nuanced evaluation framework for text-to-image (T2I) generative models, applied to human image synthesis. Our framework categorizes evaluations into two distinct groups: first, focusing on image qualities such as aesthetics and realism, and second, examining text conditions through concept coverage and fairness. We introduce an innovative aesthetic score prediction model that assesses the visual appeal of generated images and unveils the first dataset marked with low-quality regions in generated human images to facilitate automatic defect detection. Our exploration into concept coverage probes the model's effectiveness in interpreting and rendering text-based concepts accurately, while our analysis of fairness reveals biases in model outputs, with an emphasis on gender, race, and age. While our study is grounded in human imagery, this dual-faceted approach is designed with the flexibility to be applicable to other forms of image generation, enhancing our understanding of generative models and paving the way to the next generation of more sophisticated, contextually aware, and ethically attuned generative models. We will release our code, the data used for evaluating generative models and the dataset annotated with defective areas soon.
LanePtrNet: Revisiting Lane Detection as Point Voting and Grouping on Curves
Abstract
Lane detection plays a critical role in the field of autonomous driving. Prevailing methods generally adopt basic concepts (anchors, key points, etc.) from object detection and segmentation tasks, while these approaches require manual adjustments for curved objects, involve exhaustive searches on predefined anchors, require complex post-processing steps, and may lack flexibility when applied to real-world scenarios.In this paper, we propose a novel approach, LanePtrNet, which treats lane detection as a process of point voting and grouping on ordered sets: Our method takes backbone features as input and predicts a curve-aware centerness, which represents each lane as a point and assigns the most probable center point to it. A novel point sampling method is proposed to generate a set of candidate points based on the votes received. By leveraging features from local neighborhoods, and cross-instance attention score, we design a grouping module that further performs lane-wise clustering between neighboring and seeding points. Furthermore, our method can accommodate a point-based framework, (PointNet++ series, etc.) as an alternative to the backbone. This flexibility enables effortless extension to 3D lane detection tasks. We conduct comprehensive experiments to validate the effectiveness of our proposed approach, demonstrating its superior performance.
Learning Expressive And Generalizable Motion Features For Face Forgery Detection
Authors: Authors: Jingyi Zhang, Peng Zhang, Jingjing Wang, Di Xie, Shiliang Pu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Previous face forgery detection methods mainly focus on appearance features, which may be easily attacked by sophisticated manipulation. Considering the majority of current face manipulation methods generate fake faces based on a single frame, which do not take frame consistency and coordination into consideration, artifacts on frame sequences are more effective for face forgery detection. However, current sequence-based face forgery detection methods use general video classification networks directly, which discard the special and discriminative motion information for face manipulation detection. To this end, we propose an effective sequence-based forgery detection framework based on an existing video classification method. To make the motion features more expressive for manipulation detection, we propose an alternative motion consistency block instead of the original motion features module. To make the learned features more generalizable, we propose an auxiliary anomaly detection block. With these two specially designed improvements, we make a general video classification network achieve promising results on three popular face forgery datasets.
Improving the Successful Robotic Grasp Detection Using Convolutional Neural Networks
Authors: Authors: Hamed Hosseini, Mehdi Tale Masouleh, Ahmad Kalhor
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Robotic grasp should be carried out in a real-time manner by proper accuracy. Perception is the first and significant step in this procedure. This paper proposes an improved pipeline model trying to detect grasp as a rectangle representation for different seen or unseen objects. It helps the robot to start control procedures from nearer to the proper part of the object. The main idea consists in pre-processing, output normalization, and data augmentation to improve accuracy by 4.3 percent without making the system slow. Also, a comparison has been conducted over different pre-trained models like AlexNet, ResNet, Vgg19, which are the most famous feature extractors for image processing in object detection. Although AlexNet has less complexity than other ones, it outperformed them, which helps the real-time property.
SocialPET: Socially Informed Pattern Exploiting Training for Few-Shot Stance Detection in Social Media
Abstract
Stance detection, as the task of determining the viewpoint of a social media post towards a target as 'favor' or 'against', has been understudied in the challenging yet realistic scenario where there is limited labeled data for a certain target. Our work advances research in few-shot stance detection by introducing SocialPET, a socially informed approach to leveraging language models for the task. Our proposed approach builds on the Pattern Exploiting Training (PET) technique, which addresses classification tasks as cloze questions through the use of language models. To enhance the approach with social awareness, we exploit the social network structure surrounding social media posts. We prove the effectiveness of SocialPET on two stance datasets, Multi-target and P-Stance, outperforming competitive stance detection models as well as the base model, PET, where the labeled instances for the target under study is as few as 100. When we delve into the results, we observe that SocialPET is comparatively strong in identifying instances of the `against' class, where baseline models underperform.
MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts
Authors: Authors: Zinan Zeng, Sen Ye, Zijian Cai, Heng Wang, Yuhan Liu, Qinghua Zheng, Minnan Luo
Abstract
Online movie review websites are valuable for information and discussion about movies. However, the massive spoiler reviews detract from the movie-watching experience, making spoiler detection an important task. Previous methods simply focus on reviews' text content, ignoring the heterogeneity of information in the platform. For instance, the metadata and the corresponding user's information of a review could be helpful. Besides, the spoiler language of movie reviews tends to be genre-specific, thus posing a domain generalization challenge for existing methods. To this end, we propose MMoE, a multi-modal network that utilizes information from multiple modalities to facilitate robust spoiler detection and adopts Mixture-of-Experts to enhance domain generalization. MMoE first extracts graph, text, and meta feature from the user-movie network, the review's textual content, and the review's metadata respectively. To handle genre-specific spoilers, we then adopt Mixture-of-Experts architecture to process information in three modalities to promote robustness. Finally, we use an expert fusion layer to integrate the features from different perspectives and make predictions based on the fused embedding. Experiments demonstrate that MMoE achieves state-of-the-art performance on two widely-used spoiler detection datasets, surpassing previous SOTA methods by 2.56\% and 8.41\% in terms of accuracy and F1-score. Further experiments also demonstrate MMoE's superiority in robustness and generalization.
Deep Prompt Multi-task Network for Abuse Language Detection
Authors: Authors: Jian Zhu, Yuping Ruan, Jingfei Chang, Cheng Luo
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Abstract
The detection of abusive language remains a long-standing challenge with the extensive use of social networks. The detection task of abusive language suffers from limited accuracy. We argue that the existing detection methods utilize the fine-tuning technique of the pre-trained language models (PLMs) to handle downstream tasks. Hence, these methods fail to stimulate the general knowledge of the PLMs. To address the problem, we propose a novel Deep Prompt Multi-task Network (DPMN) for abuse language detection. Specifically, DPMN first attempts to design two forms of deep prompt tuning and light prompt tuning for the PLMs. The effects of different prompt lengths, tuning strategies, and prompt initialization methods on detecting abusive language are studied. In addition, we propose a Task Head based on Bi-LSTM and FFN, which can be used as a short text classifier. Eventually, DPMN utilizes multi-task learning to improve detection metrics further. The multi-task network has the function of transferring effective knowledge. The proposed DPMN is evaluated against eight typical methods on three public datasets: OLID, SOLID, and AbuseAnalyzer. The experimental results show that our DPMN outperforms the state-of-the-art methods.
VLM-PL: Advanced Pseudo Labeling approach Class Incremental Object Detection with Vision-Language Model
Authors: Authors: Junsu Kim, Yunhoe Ku, Jihyeon Kim, Junuk Cha, Seungryul Baek
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
In the field of Class Incremental Object Detection (CIOD), creating models that can continuously learn like humans is a major challenge. Pseudo-labeling methods, although initially powerful, struggle with multi-scenario incremental learning due to their tendency to forget past knowledge. To overcome this, we introduce a new approach called Vision-Language Model assisted Pseudo-Labeling (VLM-PL). This technique uses Vision-Language Model (VLM) to verify the correctness of pseudo ground-truths (GTs) without requiring additional model training. VLM-PL starts by deriving pseudo GTs from a pre-trained detector. Then, we generate custom queries for each pseudo GT using carefully designed prompt templates that combine image and text features. This allows the VLM to classify the correctness through its responses. Furthermore, VLM-PL integrates refined pseudo and real GTs from upcoming training, effectively combining new and old knowledge. Extensive experiments conducted on the Pascal VOC and MS COCO datasets not only highlight VLM-PL's exceptional performance in multi-scenario but also illuminate its effectiveness in dual-scenario by achieving state-of-the-art results in both.
Frequency-Adaptive Dilated Convolution for Semantic Segmentation
Authors: Authors: Linwei Chen, Lin Gu, Ying Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Dilated convolution, which expands the receptive field by inserting gaps between its consecutive elements, is widely employed in computer vision. In this study, we propose three strategies to improve individual phases of dilated convolution from the view of spectrum analysis. Departing from the conventional practice of fixing a global dilation rate as a hyperparameter, we introduce Frequency-Adaptive Dilated Convolution (FADC), which dynamically adjusts dilation rates spatially based on local frequency components. Subsequently, we design two plug-in modules to directly enhance effective bandwidth and receptive field size. The Adaptive Kernel (AdaKern) module decomposes convolution weights into low-frequency and high-frequency components, dynamically adjusting the ratio between these components on a per-channel basis. By increasing the high-frequency part of convolution weights, AdaKern captures more high-frequency components, thereby improving effective bandwidth. The Frequency Selection (FreqSelect) module optimally balances high- and low-frequency components in feature representations through spatially variant reweighting. It suppresses high frequencies in the background to encourage FADC to learn a larger dilation, thereby increasing the receptive field for an expanded scope. Extensive experiments on segmentation and object detection consistently validate the efficacy of our approach. The code is publicly available at \url{https://github.com/Linwei-Chen/FADC}.
Spectrogram-Based Detection of Auto-Tuned Vocals in Music Recordings
Authors: Authors: Mahyar Gohari, Paolo Bestagini, Sergio Benini, Nicola Adami
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Abstract
In the domain of music production and audio processing, the implementation of automatic pitch correction of the singing voice, also known as Auto-Tune, has significantly transformed the landscape of vocal performance. While auto-tuning technology has offered musicians the ability to tune their vocal pitches and achieve a desired level of precision, its use has also sparked debates regarding its impact on authenticity and artistic integrity. As a result, detecting and analyzing Auto-Tuned vocals in music recordings has become essential for music scholars, producers, and listeners. However, to the best of our knowledge, no prior effort has been made in this direction. This study introduces a data-driven approach leveraging triplet networks for the detection of Auto-Tuned songs, backed by the creation of a dataset composed of original and Auto-Tuned audio clips. The experimental results demonstrate the superiority of the proposed method in both accuracy and robustness compared to Rawnet2, an end-to-end model proposed for anti-spoofing and widely used for other audio forensic tasks.
Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery
Authors: Authors: Xavier Bou, Gabriele Facciolo, Rafael Grompone von Gioi, Jean-Michel Morel, Thibaud Ehret
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The goal of this paper is to perform object detection in satellite imagery with only a few examples, thus enabling users to specify any object class with minimal annotation. To this end, we explore recent methods and ideas from open-vocabulary detection for the remote sensing domain. We develop a few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier. A large-scale pre-trained model is used to build class-reference embeddings or prototypes, which are compared to region proposal contents for label prediction. In addition, we propose to fine-tune prototypes on available training images to boost performance and learn differences between similar classes, such as aircraft types. We perform extensive evaluations on two remote sensing datasets containing challenging and rare objects. Moreover, we study the performance of both visual and image-text features, namely DINOv2 and CLIP, including two CLIP models specifically tailored for remote sensing applications. Results indicate that visual features are largely superior to vision-language models, as the latter lack the necessary domain-specific vocabulary. Lastly, the developed detector outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters.
SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised Learning for Robust Infrared Small Target Detection
Authors: Authors: Yahao Lu, Yupei Lin, Han Wu, Xiaoyu Xian, Yukai Shi, Liang Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Single-frame infrared small target (SIRST) detection aims to recognize small targets from clutter backgrounds. Recently, convolutional neural networks have achieved significant advantages in general object detection. With the development of Transformer, the scale of SIRST models is constantly increasing. Due to the limited training samples, performance has not been improved accordingly. The quality, quantity, and diversity of the infrared dataset are critical to the detection of small targets. To highlight this issue, we propose a negative sample augmentation method in this paper. Specifically, a negative augmentation approach is proposed to generate massive negatives for self-supervised learning. Firstly, we perform a sequential noise modeling technology to generate realistic infrared data. Secondly, we fuse the extracted noise with the original data to facilitate diversity and fidelity in the generated data. Lastly, we proposed a negative augmentation strategy to enrich diversity as well as maintain semantic invariance. The proposed algorithm produces a synthetic SIRST-5K dataset, which contains massive pseudo-data and corresponding labels. With a rich diversity of infrared small target data, our algorithm significantly improves the model performance and convergence speed. Compared with other state-of-the-art (SOTA) methods, our method achieves outstanding performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection over union (IoU).
EVD4UAV: An Altitude-Sensitive Benchmark to Evade Vehicle Detection in UAV
Abstract
Vehicle detection in Unmanned Aerial Vehicle (UAV) captured images has wide applications in aerial photography and remote sensing. There are many public benchmark datasets proposed for the vehicle detection and tracking in UAV images. Recent studies show that adding an adversarial patch on objects can fool the well-trained deep neural networks based object detectors, posing security concerns to the downstream tasks. However, the current public UAV datasets might ignore the diverse altitudes, vehicle attributes, fine-grained instance-level annotation in mostly side view with blurred vehicle roof, so none of them is good to study the adversarial patch based vehicle detection attack problem. In this paper, we propose a new dataset named EVD4UAV as an altitude-sensitive benchmark to evade vehicle detection in UAV with 6,284 images and 90,886 fine-grained annotated vehicles. The EVD4UAV dataset has diverse altitudes (50m, 70m, 90m), vehicle attributes (color, type), fine-grained annotation (horizontal and rotated bounding boxes, instance-level mask) in top view with clear vehicle roof. One white-box and two black-box patch based attack methods are implemented to attack three classic deep neural networks based object detectors on EVD4UAV. The experimental results show that these representative attack methods could not achieve the robust altitude-insensitive attack performance.
Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT
Abstract
Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic features extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations of pre-trained language models for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better even with an increasing number of authors, and performance remains steady despite few training samples.
Keyword: face recognition
Federated Learning Method for Preserving Privacy in Face Recognition System
Authors: Authors: Enoch Solomon, Abraham Woubie
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
The state-of-the-art face recognition systems are typically trained on a single computer, utilizing extensive image datasets collected from various number of users. However, these datasets often contain sensitive personal information that users may hesitate to disclose. To address potential privacy concerns, we explore the application of federated learning, both with and without secure aggregators, in the context of both supervised and unsupervised face recognition systems. Federated learning facilitates the training of a shared model without necessitating the sharing of individual private data, achieving this by training models on decentralized edge devices housing the data. In our proposed system, each edge device independently trains its own model, which is subsequently transmitted either to a secure aggregator or directly to the central server. To introduce diverse data without the need for data transmission, we employ generative adversarial networks to generate imposter data at the edge. Following this, the secure aggregator or central server combines these individual models to construct a global model, which is then relayed back to the edge devices. Experimental findings based on the CelebA datasets reveal that employing federated learning in both supervised and unsupervised face recognition systems offers dual benefits. Firstly, it safeguards privacy since the original data remains on the edge devices. Secondly, the experimental results demonstrate that the aggregated model yields nearly identical performance compared to the individual models, particularly when the federated model does not utilize a secure aggregator. Hence, our results shed light on the practical challenges associated with privacy-preserving face image training, particularly in terms of the balance between privacy and accuracy.
Keyword: augmentation
Self-Supervision in Time for Satellite Images(S3-TSS): A novel method of SSL technique in Satellite images
Authors: Authors: Akansh Maurya, Hewan Shrestha, Mohammad Munem Shahriar
Abstract
With the limited availability of labeled data with various atmospheric conditions in remote sensing images, it seems useful to work with self-supervised algorithms. Few pretext-based algorithms, including from rotation, spatial context and jigsaw puzzles are not appropriate for satellite images. Often, satellite images have a higher temporal frequency. So, the temporal dimension of remote sensing data provides natural augmentation without requiring us to create artificial augmentation of images. Here, we propose S3-TSS, a novel method of self-supervised learning technique that leverages natural augmentation occurring in temporal dimension. We compare our results with current state-of-the-art methods and also perform various experiments. We observed that our method was able to perform better than baseline SeCo in four downstream datasets. Code for our work can be found here: https://github.com/hewanshrestha/Why-Self-Supervision-in-Time
PIPsUS: Self-Supervised Dense Point Tracking in Ultrasound
Authors: Authors: Wanwen Chen, Adam Schmidt, Eitan Prisman, Septimiu E Salcudean
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Finding point-level correspondences is a fundamental problem in ultrasound (US), since it can enable US landmark tracking for intraoperative image guidance in different surgeries, including head and neck. Most existing US tracking methods, e.g., those based on optical flow or feature matching, were initially designed for RGB images before being applied to US. Therefore domain shift can impact their performance. Training could be supervised by ground-truth correspondences, but these are expensive to acquire in US. To solve these problems, we propose a self-supervised pixel-level tracking model called PIPsUS. Our model can track an arbitrary number of points in one forward pass and exploits temporal information by considering multiple, instead of just consecutive, frames. We developed a new self-supervised training strategy that utilizes a long-term point-tracking model trained for RGB images as a teacher to guide the model to learn realistic motions and use data augmentation to enforce tracking from US appearance. We evaluate our method on neck and oral US and echocardiography, showing higher point tracking accuracy when compared with fast normalized cross-correlation and tuned optical flow. Code will be available once the paper is accepted.
A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision
Authors: Authors: Jiwon Choi, Wooyoung Jo, Seongyon Hong, Beomseok Kwon, Wonhoon Park, Hoi-Jun Yoo
Abstract
This paper presents an energy-efficient stable diffusion processor for text-to-image generation. While stable diffusion attained attention for high-quality image synthesis results, its inherent characteristics hinder its deployment on mobile platforms. The proposed processor achieves high throughput and energy efficiency with three key features as solutions: 1) Patch similarity-based sparsity augmentation (PSSA) to reduce external memory access (EMA) energy of self-attention score by 60.3 %, leading to 37.8 % total EMA energy reduction. 2) Text-based important pixel spotting (TIPS) to allow 44.8 % of the FFN layer workload to be processed with low-precision activation. 3) Dual-mode bit-slice core (DBSC) architecture to enhance energy efficiency in FFN layers by 43.0 %. The proposed processor is implemented in 28 nm CMOS technology and achieves 3.84 TOPS peak throughput with 225.6 mW average power consumption. In sum, 28.6 mJ/iteration highly energy-efficient text-to-image generation processor can be achieved at MS-COCO dataset.
PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts
Authors: Authors: Zewen Chen, Haina Qin, Juan Wang, Chunfeng Yuan, Bing Li, Weiming Hu, Liang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Due to the diversity of assessment requirements in various application scenarios for the IQA task, existing IQA methods struggle to directly adapt to these varied requirements after training. Thus, when facing new requirements, a typical approach is fine-tuning these models on datasets specifically created for those requirements. However, it is time-consuming to establish IQA datasets. In this work, we propose a Prompt-based IQA (PromptIQA) that can directly adapt to new requirements without fine-tuning after training. On one hand, it utilizes a short sequence of Image-Score Pairs (ISP) as prompts for targeted predictions, which significantly reduces the dependency on the data requirements. On the other hand, PromptIQA is trained on a mixed dataset with two proposed data augmentation strategies to learn diverse requirements, thus enabling it to effectively adapt to new requirements. Experiments indicate that the PromptIQA outperforms SOTA methods with higher performance and better generalization. The code will be available.
DiffClass: Diffusion-Based Class Incremental Learning
Authors: Authors: Zichong Meng, Jie Zhang, Changdi Yang, Zheng Zhan, Pu Zhao, Yanzhi WAng
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Class Incremental Learning (CIL) is challenging due to catastrophic forgetting. On top of that, Exemplar-free Class Incremental Learning is even more challenging due to forbidden access to previous task data. Recent exemplar-free CIL methods attempt to mitigate catastrophic forgetting by synthesizing previous task data. However, they fail to overcome the catastrophic forgetting due to the inability to deal with the significant domain gap between real and synthetic data. To overcome these issues, we propose a novel exemplar-free CIL method. Our method adopts multi-distribution matching (MDM) diffusion models to unify quality and bridge domain gaps among all domains of training data. Moreover, our approach integrates selective synthetic image augmentation (SSIA) to expand the distribution of the training data, thereby improving the model's plasticity and reinforcing the performance of our method's ultimate component, multi-domain adaptation (MDA). With the proposed integrations, our method then reformulates exemplar-free CIL into a multi-domain adaptation problem to implicitly address the domain gap problem to enhance model stability during incremental training. Extensive experiments on benchmark class incremental datasets and settings demonstrate that our method excels previous exemplar-free CIL methods and achieves state-of-the-art performance.
Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision
Abstract
The existing graph neural architecture search (GNAS) methods heavily rely on supervised labels during the search process, failing to handle ubiquitous scenarios where supervisions are not available. In this paper, we study the problem of unsupervised graph neural architecture search, which remains unexplored in the literature. The key problem is to discover the latent graph factors that drive the formation of graph data as well as the underlying relations between the factors and the optimal neural architectures. Handling this problem is challenging given that the latent graph factors together with architectures are highly entangled due to the nature of the graph and the complexity of the neural architecture search process. To address the challenge, we propose a novel Disentangled Self-supervised Graph Neural Architecture Search (DSGAS) model, which is able to discover the optimal architectures capturing various latent graph factors in a self-supervised fashion based on unlabeled graph data. Specifically, we first design a disentangled graph super-network capable of incorporating multiple architectures with factor-wise disentanglement, which are optimized simultaneously. Then, we estimate the performance of architectures under different factors by our proposed self-supervised training with joint architecture-graph disentanglement. Finally, we propose a contrastive search with architecture augmentations to discover architectures with factor-specific expertise. Extensive experiments on 11 real-world datasets demonstrate that the proposed model is able to achieve state-of-the-art performance against several baseline methods in an unsupervised manner.
Improving the Successful Robotic Grasp Detection Using Convolutional Neural Networks
Authors: Authors: Hamed Hosseini, Mehdi Tale Masouleh, Ahmad Kalhor
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Robotic grasp should be carried out in a real-time manner by proper accuracy. Perception is the first and significant step in this procedure. This paper proposes an improved pipeline model trying to detect grasp as a rectangle representation for different seen or unseen objects. It helps the robot to start control procedures from nearer to the proper part of the object. The main idea consists in pre-processing, output normalization, and data augmentation to improve accuracy by 4.3 percent without making the system slow. Also, a comparison has been conducted over different pre-trained models like AlexNet, ResNet, Vgg19, which are the most famous feature extractors for image processing in object detection. Although AlexNet has less complexity than other ones, it outperformed them, which helps the real-time property.
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity
Abstract
Learning better sentence embeddings leads to improved performance for natural language understanding tasks including semantic textual similarity (STS) and natural language inference (NLI). As prior studies leverage large-scale labeled NLI datasets for fine-tuning masked language models to yield sentence embeddings, task performance for languages other than English is often left behind. In this study, we directly compared two data augmentation techniques as potential solutions for monolingual STS: (a) cross-lingual transfer that exploits English resources alone as training data to yield non-English sentence embeddings as zero-shot inference, and (b) machine translation that coverts English data into pseudo non-English training data in advance. In our experiments on monolingual STS in Japanese and Korean, we find that the two data techniques yield performance on par. Rather, we find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data. Combining our findings, we demonstrate that the cross-lingual transfer of Wikipedia data exhibits improved performance, and that native Wikipedia data can further improve performance for monolingual STS.
SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised Learning for Robust Infrared Small Target Detection
Authors: Authors: Yahao Lu, Yupei Lin, Han Wu, Xiaoyu Xian, Yukai Shi, Liang Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Single-frame infrared small target (SIRST) detection aims to recognize small targets from clutter backgrounds. Recently, convolutional neural networks have achieved significant advantages in general object detection. With the development of Transformer, the scale of SIRST models is constantly increasing. Due to the limited training samples, performance has not been improved accordingly. The quality, quantity, and diversity of the infrared dataset are critical to the detection of small targets. To highlight this issue, we propose a negative sample augmentation method in this paper. Specifically, a negative augmentation approach is proposed to generate massive negatives for self-supervised learning. Firstly, we perform a sequential noise modeling technology to generate realistic infrared data. Secondly, we fuse the extracted noise with the original data to facilitate diversity and fidelity in the generated data. Lastly, we proposed a negative augmentation strategy to enrich diversity as well as maintain semantic invariance. The proposed algorithm produces a synthetic SIRST-5K dataset, which contains massive pseudo-data and corresponding labels. With a rich diversity of infrared small target data, our algorithm significantly improves the model performance and convergence speed. Compared with other state-of-the-art (SOTA) methods, our method achieves outstanding performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection over union (IoU).
Poly-View Contrastive Learning
Authors: Authors: Amitis Shidani, Devon Hjelm, Jason Ramapuram, Russ Webb, Eeshan Gunesh Dhekane, Dan Busbridge
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (stat.ML)
Abstract
Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs.
Beyond Finite Data: Towards Data-free Out-of-distribution Generalization via Extrapola
Abstract
Out-of-distribution (OOD) generalization is a favorable yet challenging property for deep neural networks. The core challenges lie in the limited availability of source domains that help models learn an invariant representation from the spurious features. Various domain augmentation have been proposed but largely rely on interpolating existing domains and frequently face difficulties in creating truly "novel" domains. Humans, on the other hand, can easily extrapolate novel domains, thus, an intriguing question arises: How can neural networks extrapolate like humans and achieve OOD generalization? We introduce a novel approach to domain extrapolation that leverages reasoning ability and the extensive knowledge encapsulated within large language models (LLMs) to synthesize entirely new domains. Starting with the class of interest, we query the LLMs to extract relevant knowledge for these novel domains. We then bridge the gap between the text-centric knowledge derived from LLMs and the pixel input space of the model using text-to-image generation techniques. By augmenting the training set of domain generalization datasets with high-fidelity, photo-realistic images of these new domains, we achieve significant improvements over all existing methods, as demonstrated in both single and multi-domain generalization across various benchmarks. With the ability to extrapolate any domains for any class, our method has the potential to learn a generalized model for any task without any data. To illustrate, we put forth a much more difficult setting termed, data-free domain generalization, that aims to learn a generalized model in the absence of any collected data. Our empirical findings support the above argument and our methods exhibit commendable performance in this setting, even surpassing the supervised setting by approximately 1-2\% on datasets such as VLCS.
Keyword: detection
TopicDiff: A Topic-enriched Diffusion Approach for Multimodal Conversational Emotion Detection
A Data-Driven Two-Phase Multi-Split Causal Ensemble Model for Time Series
Enhancing Security in Federated Learning through Adaptive Consensus-Based Model Update Validation
Investigation of the Impact of Synthetic Training Data in the Industrial Application of Terminal Strip Object Detection
Divide and Conquer: High-Resolution Industrial Anomaly Detection via Memory Efficient Tiled Ensemble
ActFormer: Scalable Collaborative Perception via Active Queries
Robust Surgical Tool Tracking with Pixel-based Probabilities for Projected Geometric Primitives
Node Centrality Approximation For Large Networks Based On Inductive Graph Neural Networks
DT-SIM: Property-Based Testing for MPC Security
RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features
Simulating Battery-Powered TinyML Systems Optimised using Reinforcement Learning in Image-Based Anomaly Detection
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis
LanePtrNet: Revisiting Lane Detection as Point Voting and Grouping on Curves
Learning Expressive And Generalizable Motion Features For Face Forgery Detection
Improving the Successful Robotic Grasp Detection Using Convolutional Neural Networks
SocialPET: Socially Informed Pattern Exploiting Training for Few-Shot Stance Detection in Social Media
MMoE: Robust Spoiler Detection with Multi-modal Information and Domain-aware Mixture-of-Experts
Deep Prompt Multi-task Network for Abuse Language Detection
VLM-PL: Advanced Pseudo Labeling approach Class Incremental Object Detection with Vision-Language Model
Frequency-Adaptive Dilated Convolution for Semantic Segmentation
Spectrogram-Based Detection of Auto-Tuned Vocals in Music Recordings
Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery
SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised Learning for Robust Infrared Small Target Detection
EVD4UAV: An Altitude-Sensitive Benchmark to Evade Vehicle Detection in UAV
Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT
Keyword: face recognition
Federated Learning Method for Preserving Privacy in Face Recognition System
Keyword: augmentation
Self-Supervision in Time for Satellite Images(S3-TSS): A novel method of SSL technique in Satellite images
PIPsUS: Self-Supervised Dense Point Tracking in Ultrasound
A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision
PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts
DiffClass: Diffusion-Based Class Incremental Learning
Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision
Improving the Successful Robotic Grasp Detection Using Convolutional Neural Networks
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity
SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised Learning for Robust Infrared Small Target Detection
Poly-View Contrastive Learning
Beyond Finite Data: Towards Data-free Out-of-distribution Generalization via Extrapola