New submissions for Thu, 18 Jan 24

Keyword: detection

Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision

Authors: Authors: Yi Cao, Swetava Ganguli, Vipul Pandey
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.08581
Pdf link: https://arxiv.org/pdf/2401.08581
Abstract There exists a correlation between geospatial activity temporal patterns and type of land use. A novel self-supervised approach is proposed to stratify landscape based on mobility activity time series. First, the time series signal is transformed to the frequency domain and then compressed into task-agnostic temporal embeddings by a contractive autoencoder, which preserves cyclic temporal patterns observed in time series. The pixel-wise embeddings are converted to image-like channels that can be used for task-based, multimodal modeling of downstream geospatial tasks using deep semantic segmentation. Experiments show that temporal embeddings are semantically meaningful representations of time series data and are effective across different tasks such as classifying residential area and commercial areas. Temporal embeddings transform sequential, spatiotemporal motion trajectory data into semantically meaningful image-like tensor representations that can be combined (multimodal fusion) with other data modalities that are or can be transformed into image-like tensor representations (for e.g., RBG imagery, graph embeddings of road networks, passively collected imagery like SAR, etc.) to facilitate multimodal learning in geospatial computer vision. Multimodal computer vision is critical for training machine learning models for geospatial feature detection to keep a geospatial mapping service up-to-date in real-time and can significantly improve user experience and above all, user safety.
Improved Pothole Detection Using YOLOv7 and ESRGAN
Authors: Authors: Nirmal Kumar Rout, Gyanateet Dutta, Varun Sinha, Arghadeep Dey, Subhrangshu Mukherjee, Gopal Gupta
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.08588
Pdf link: https://arxiv.org/pdf/2401.08588
Abstract Potholes are common road hazards that is causing damage to vehicles and posing a safety risk to drivers. The introduction of Convolutional Neural Networks (CNNs) is widely used in the industry for object detection based on Deep Learning methods and has achieved significant progress in hardware improvement and software implementations. In this paper, a unique better algorithm is proposed to warrant the use of low-resolution cameras or low-resolution images and video feed for automatic pothole detection using Super Resolution (SR) through Super Resolution Generative Adversarial Networks (SRGANs). Then we have proceeded to establish a baseline pothole detection performance on low quality and high quality dashcam images using a You Only Look Once (YOLO) network, namely the YOLOv7 network. We then have illustrated and examined the speed and accuracy gained above the benchmark after having upscaling implementation on the low quality images.
Online Anomaly Detection over Live Social Video Streaming
Authors: Authors: Chengkun He, Xiangmin Zhou, Chen Wang, Iqbal Gondal, Jie Shao, Xun Yi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.08615
Pdf link: https://arxiv.org/pdf/2401.08615
Abstract Social video anomaly is an observation in video streams that does not conform to a common pattern of dataset's behaviour. Social video anomaly detection plays a critical role in applications from e-commerce to e-learning. Traditionally, anomaly detection techniques are applied to find anomalies in video broadcasting. However, they neglect the live social video streams which contain interactive talk, speech, or lecture with audience. In this paper, we propose a generic framework for effectively online detecting Anomalies Over social Video LIve Streaming (AOVLIS). Specifically, we propose a novel deep neural network model called Coupling Long Short-Term Memory (CLSTM) that adaptively captures the history behaviours of the presenters and audience, and their mutual interactions to predict their behaviour at next time point over streams. Then we well integrate the CLSTM with a decoder layer, and propose a new reconstruction error-based scoring function $RE_{IA}$ to calculate the anomaly score of each video segment for anomaly detection. After that, we propose a novel model update scheme that incrementally maintains CLSTM and decoder. Moreover, we design a novel upper bound and ADaptive Optimisation Strategy (ADOS) for improving the efficiency of our solution. Extensive experiments are conducted to prove the superiority of AOVLIS.
Immature Green Apple Detection and Sizing in Commercial Orchards using YOLOv8 and Shape Fitting Techniques
Authors: Authors: Ranjan Sapkota, Dawood Ahmed, Martin Churuvija, Manoj Karkee
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.08629
Pdf link: https://arxiv.org/pdf/2401.08629
Abstract Detecting and estimating size of apples during the early stages of growth is crucial for predicting yield, pest management, and making informed decisions related to crop-load management, harvest and post-harvest logistics, and marketing. Traditional fruit size measurement methods are laborious and time-consuming. This study employs the state-of-the-art YOLOv8 object detection and instance segmentation algorithm in conjunction with geometric shape fitting techniques on 3D point cloud data to accurately determine the size of immature green apples (or fruitlet) in a commercial orchard environment. The methodology utilized two RGB-D sensors: the Intel RealSense D435i and the Microsoft Azure Kinect DK. Notably, the YOLOv8 instance segmentation models exhibited proficiency in immature green apple detection, with the YOLOv8m-seg model clinching the highest AP@0.5 and AP@0.75 scores of 0.94 and 0.91, respectively. Leveraging the ellipsoid fitting technique on images from the Azure Kinect, we observed remarkable metrics, including an RMSE of 2.35, MAE of 1.66, MAPE of 6.15, and an R-squared value of 0.9. Challenges such as partial occlusion, where YOLOv8 sometimes misinterpreted immature green apple clusters, were recognized. In a comparison of 102 outdoor samples, the Microsoft Azure Kinect showed better performance than the Intel Realsense D435i, as supported by the MAE data. This study emphasizes the combined effectiveness of shape-fitting methods and 3D sensors in improving fruitlet sizing for agriculture.
Attention Modules Improve Modern Image-Level Anomaly Detection: A DifferNet Case Study
Authors: Authors: André Luiz B. Vieira e Silva, Francisco Simões, Danny Kowerko, Tobias Schlosser, Felipe Battisti, Veronica Teichrieb
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.08686
Pdf link: https://arxiv.org/pdf/2401.08686
Abstract Within (semi-)automated visual inspection, learning-based approaches for assessing visual defects, including deep neural networks, enable the processing of otherwise small defect patterns in pixel size on high-resolution imagery. The emergence of these often rarely occurring defect patterns explains the general need for labeled data corpora. To not only alleviate this issue but to furthermore advance the current state of the art in unsupervised visual inspection, this contribution proposes a DifferNet-based solution enhanced with attention modules utilizing SENet and CBAM as backbone - AttentDifferNet - to improve the detection and classification capabilities on three different visual inspection and anomaly detection datasets: MVTec AD, InsPLAD-fault, and Semiconductor Wafer. In comparison to the current state of the art, it is shown that AttentDifferNet achieves improved results, which are, in turn, highlighted throughout our quantitative as well as qualitative evaluation, indicated by a general improvement in AUC of 94.34 vs. 92.46, 96.67 vs. 94.69, and 90.20 vs. 88.74%. As our variants to AttentDifferNet show great prospects in the context of currently investigated approaches, a baseline is formulated, emphasizing the importance of attention for anomaly detection.
DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception
Authors: Authors: Kai Jiang, Jiaxing Huang, Weiying Xie, Yunsong Li, Ling Shao, Shijian Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.08687
Pdf link: https://arxiv.org/pdf/2401.08687
Abstract Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space. However, most existing studies were conducted under a supervised setup which cannot scale well while handling various new data. Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored. In this work, we design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features. DA-BEV introduces the idea of query into the domain adaptation framework to derive useful information from image-view and BEV features. It consists of two query-based designs, namely, query-based adversarial learning (QAL) and query-based self-training (QST), which exploits image-view features or BEV features to regularize the adaptation of the other. Extensive experiments show that DA-BEV achieves superior domain adaptive BEV perception performance consistently across multiple datasets and tasks such as 3D object detection and 3D scene segmentation.
NODI: Out-Of-Distribution Detection with Noise from Diffusion
Authors: Authors: Jingqiu Zhou, Aojun Zou, Hongshen Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.08689
Pdf link: https://arxiv.org/pdf/2401.08689
Abstract Out-of-distribution (OOD) detection is a crucial part of deploying machine learning models safely. It has been extensively studied with a plethora of methods developed in the literature. This problem is tackled with an OOD score computation, however, previous methods compute the OOD scores with limited usage of the in-distribution dataset. For instance, the OOD scores are computed with information from a small portion of the in-distribution data. Furthermore, these methods encode images with a neural image encoder. The robustness of these methods is rarely checked with respect to image encoders of different training methods and architectures. In this work, we introduce the diffusion process into the OOD task. The diffusion model integrates information on the whole training set into the predicted noise vectors. What's more, we deduce a closed-form solution for the noise vector (stable point). Then the noise vector is converted into our OOD score, we test both the deep model predicted noise vector and the closed-form noise vector on the OOD benchmarks \cite{openood}. Our method outperforms previous OOD methods across all types of image encoders (Table. \ref{main}). A $3.5\%$ performance gain is achieved with the MAE-based image encoder. Moreover, we studied the robustness of OOD methods by applying different types of image encoders. Some OOD methods failed to generalize well when switching image encoders from ResNet to Vision Transformers, our method performs exhibits good robustness with all the image encoders.
AiGen-FoodReview: A Multimodal Dataset of Machine-Generated Restaurant Reviews and Images on Social Media
Authors: Authors: Alessandro Gambetti, Qiwei Han
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.08825
Pdf link: https://arxiv.org/pdf/2401.08825
Abstract Online reviews in the form of user-generated content (UGC) significantly impact consumer decision-making. However, the pervasive issue of not only human fake content but also machine-generated content challenges UGC's reliability. Recent advances in Large Language Models (LLMs) may pave the way to fabricate indistinguishable fake generated content at a much lower cost. Leveraging OpenAI's GPT-4-Turbo and DALL-E-2 models, we craft AiGen-FoodReview, a multi-modal dataset of 20,144 restaurant review-image pairs divided into authentic and machine-generated. We explore unimodal and multimodal detection models, achieving 99.80% multimodal accuracy with FLAVA. We use attributes from readability and photographic theories to score reviews and images, respectively, demonstrating their utility as hand-crafted features in scalable and interpretable detection models, with comparable performance. The paper contributes by open-sourcing the dataset and releasing fake review detectors, recommending its use in unimodal and multimodal fake review detection tasks, and evaluating linguistic and visual features in synthetic versus authentic data.
Image Fusion in Remote Sensing: An Overview and Meta Analysis
Authors: Authors: Hessah Albanwan, Rongjun Qin, Yang Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2401.08837
Pdf link: https://arxiv.org/pdf/2401.08837
Abstract Image fusion in Remote Sensing (RS) has been a consistent demand due to its ability to turn raw images of different resolutions, sources, and modalities into accurate, complete, and spatio-temporally coherent images. It greatly facilitates downstream applications such as pan-sharpening, change detection, land-cover classification, etc. Yet, image fusion solutions are highly disparate to various remote sensing problems and thus are often narrowly defined in existing reviews as topical applications, such as pan-sharpening, and spatial-temporal image fusion. Considering that image fusion can be theoretically applied to any gridded data through pixel-level operations, in this paper, we expanded its scope by comprehensively surveying relevant works with a simple taxonomy: 1) many-to-one image fusion; 2) many-to-many image fusion. This simple taxonomy defines image fusion as a mapping problem that turns either a single or a set of images into another single or set of images, depending on the desired coherence, e.g., spectral, spatial/resolution coherence, etc. We show that this simple taxonomy, despite the significant modality difference it covers, can be presented by a conceptually easy framework. In addition, we provide a meta-analysis to review the major papers studying the various types of image fusion and their applications over the years (from the 1980s to date), covering 5,926 peer-reviewed papers. Finally, we discuss the main benefits and emerging challenges to provide open research directions and potential future works.
Exploring Content-Based and Meta-Data Analysis for Detecting Fake News Infodemic: A case study on COVID-19
Authors: Authors: Oluwaseun Ajao, Ashish Garg, Marjory Da Costa-Abreu
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2401.08841
Pdf link: https://arxiv.org/pdf/2401.08841
Abstract The coronavirus pandemic (COVID-19) is probably the most disruptive global health disaster in recent history. It negatively impacted the whole world and virtually brought the global economy to a standstill. However, as the virus was spreading, infecting people and claiming thousands of lives so was the spread and propagation of fake news, misinformation and disinformation about the event. These included the spread of unconfirmed health advice and remedies on social media. In this paper, false information about the pandemic is identified using a content-based approach and metadata curated from messages posted to online social networks. A content-based approach combined with metadata as well as an initial feature analysis is used and then several supervised learning models are tested for identifying and predicting misleading posts. Our approach shows up to 93% accuracy in the detection of fake news related posts about the COVID-19 pandemic
Benchmarking Particle Filter Algorithms for Efficient Velodyne-Based Vehicle Localization
Authors: Authors: Jose Luis Blanco-Claraco, Francisco Mañas-Alvarez, Jose Luis Torres-Moreno, Francisco Rodriguez, Antonio Gimenez-Fernandez
Subjects: Robotics (cs.RO); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2401.08870
Pdf link: https://arxiv.org/pdf/2401.08870
Abstract Keeping a vehicle well-localized within a prebuilt-map is at the core of any autonomous vehicle navigation system. In this work, we show that both standard SIR sampling and rejection-based optimal sampling are suitable for efficient (10 to 20 ms) real-time pose tracking without feature detection that is using raw point clouds from a 3D LiDAR. Motivated by the large amount of information captured by these sensors, we perform a systematic statistical analysis of how many points are actually required to reach an optimal ratio between efficiency and positioning accuracy. Furthermore, initialization from adverse conditions, e.g., poor GPS signal in urban canyons, we also identify the optimal particle filter settings required to ensure convergence. Our findings include that a decimation factor between 100 and 200 on incoming point clouds provides a large savings in computational cost with a negligible loss in localization accuracy for a VLP-16 scanner. Furthermore, an initial density of $\sim$2 particles/m$^2$ is required to achieve 100% convergence success for large-scale ($\sim$100,000 m$^2$), outdoor global localization without any additional hint from GPS or magnetic field sensors. All implementations have been released as open-source software.
Learning to detect cloud and snow in remote sensing images from noisy labels
Authors: Authors: Zili Liu, Hao Chen, Wenyuan Li, Keyan Chen, Zipeng Qi, Chenyang Liu, Zhengxia Zou, Zhenwei Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.08932
Pdf link: https://arxiv.org/pdf/2401.08932
Abstract Detecting clouds and snow in remote sensing images is an essential preprocessing task for remote sensing imagery. Previous works draw inspiration from semantic segmentation models in computer vision, with most research focusing on improving model architectures to enhance detection performance. However, unlike natural images, the complexity of scenes and the diversity of cloud types in remote sensing images result in many inaccurate labels in cloud and snow detection datasets, introducing unnecessary noises into the training and testing processes. By constructing a new dataset and proposing a novel training strategy with the curriculum learning paradigm, we guide the model in reducing overfitting to noisy labels. Additionally, we design a more appropriate model performance evaluation method, that alleviates the performance assessment bias caused by noisy labels. By conducting experiments on models with UNet and Segformer, we have validated the effectiveness of our proposed method. This paper is the first to consider the impact of label noise on the detection of clouds and snow in remote sensing images.
AntiPhishStack: LSTM-based Stacked Generalization Model for Optimized Phishing URLs Detection
Authors: Authors: Saba Aslam, Hafsa Aslam, Arslan Manzoor, Chen Hui, Abdur Rasool
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.08947
Pdf link: https://arxiv.org/pdf/2401.08947
Abstract The escalating reliance on revolutionary online web services has introduced heightened security risks, with persistent challenges posed by phishing despite extensive security measures. Traditional phishing systems, reliant on machine learning and manual features, struggle with evolving tactics. Recent advances in deep learning offer promising avenues for tackling novel phishing challenges and malicious URLs. This paper introduces a two-phase stack generalized model named AntiPhishStack, designed to detect phishing sites. The model leverages the learning of URLs and character-level TF-IDF features symmetrically, enhancing its ability to combat emerging phishing threats. In Phase I, features are trained on a base machine learning classifier, employing K-fold cross-validation for robust mean prediction. Phase II employs a two-layered stacked-based LSTM network with five adaptive optimizers for dynamic compilation, ensuring premier prediction on these features. Additionally, the symmetrical predictions from both phases are optimized and integrated to train a meta-XGBoost classifier, contributing to a final robust prediction. The significance of this work lies in advancing phishing detection with AntiPhishStack, operating without prior phishing-specific feature knowledge. Experimental validation on two benchmark datasets, comprising benign and phishing or malicious URLs, demonstrates the model's exceptional performance, achieving a notable 96.04% accuracy compared to existing studies. This research adds value to the ongoing discourse on symmetry and asymmetry in information security and provides a forward-thinking solution for enhancing network security in the face of evolving cyber threats.
Hearing Loss Detection from Facial Expressions in One-on-one Conversations
Authors: Authors: Yufeng Yin, Ishwarya Ananthabhotla, Vamsi Krishna Ithapu, Stavros Petridis, Yu-Hsiang Wu, Christi Miller
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.08972
Pdf link: https://arxiv.org/pdf/2401.08972
Abstract Individuals with impaired hearing experience difficulty in conversations, especially in noisy environments. This difficulty often manifests as a change in behavior and may be captured via facial expressions, such as the expression of discomfort or fatigue. In this work, we build on this idea and introduce the problem of detecting hearing loss from an individual's facial expressions during a conversation. Building machine learning models that can represent hearing-related facial expression changes is a challenge. In addition, models need to disentangle spurious age-related correlations from hearing-driven expressions. To this end, we propose a self-supervised pre-training strategy tailored for the modeling of expression variations. We also use adversarial representation learning to mitigate the age bias. We evaluate our approach on a large-scale egocentric dataset with real-world conversational scenarios involving subjects with hearing loss and show that our method for hearing loss detection achieves superior performance over baselines.
A GAN-based data poisoning framework against anomaly detection in vertical federated learning
Authors: Authors: Xiaolin Chen, Daoguang Zan, Wei Li, Bei Guan, Yongji Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2401.08984
Pdf link: https://arxiv.org/pdf/2401.08984
Abstract In vertical federated learning (VFL), commercial entities collaboratively train a model while preserving data privacy. However, a malicious participant's poisoning attack may degrade the performance of this collaborative model. The main challenge in achieving the poisoning attack is the absence of access to the server-side top model, leaving the malicious participant without a clear target model. To address this challenge, we introduce an innovative end-to-end poisoning framework P-GAN. Specifically, the malicious participant initially employs semi-supervised learning to train a surrogate target model. Subsequently, this participant employs a GAN-based method to produce adversarial perturbations to degrade the surrogate target model's performance. Finally, the generator is obtained and tailored for VFL poisoning. Besides, we develop an anomaly detection algorithm based on a deep auto-encoder (DAE), offering a robust defense mechanism to VFL scenarios. Through extensive experiments, we evaluate the efficacy of P-GAN and DAE, and further analyze the factors that influence their performance.
Knight Watch: Ubiquitous Computing Enhancements To Sleep Quality With Acoustic Analysis
Authors: Authors: Andrew Ajemian, John Knight, Tommy Nguyen, John O'Connor
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2401.08991
Pdf link: https://arxiv.org/pdf/2401.08991
Abstract This project introduces a wearable, non-intrusive device for snoring detection and remediation, designed to be placed under or alongside a pillow. The device uses sensors and machine learning algorithms to detect snoring and employs gentle vibrations to prompt positional changes, thereby reducing snoring episodes. The device is capable of connecting via an API to a cloud-based platform for the analysis of snoring sleep patterns and environmental context. The paper details the development from concept to prototype, emphasizing the technical challenges, solutions, and alignment with ubiquitous computing in sleep quality improvement.
Explain Thyself Bully: Sentiment Aided Cyberbullying Detection with Explanation
Authors: Authors: Krishanu Maity, Prince Jha, Raghav Jain, Sriparna Saha, Pushpak Bhattacharyya
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.09023
Pdf link: https://arxiv.org/pdf/2401.09023
Abstract Cyberbullying has become a big issue with the popularity of different social media networks and online communication apps. While plenty of research is going on to develop better models for cyberbullying detection in monolingual language, there is very little research on the code-mixed languages and explainability aspect of cyberbullying. Recent laws like "right to explanations" of General Data Protection Regulation, have spurred research in developing interpretable models rather than focusing on performance. Motivated by this we develop the first interpretable multi-task model called {\em mExCB} for automatic cyberbullying detection from code-mixed languages which can simultaneously solve several tasks, cyberbullying detection, explanation/rationale identification, target group detection and sentiment analysis. We have introduced {\em BullyExplain}, the first benchmark dataset for explainable cyberbullying detection in code-mixed language. Each post in {\em BullyExplain} dataset is annotated with four labels, i.e., {\em bully label, sentiment label, target and rationales (explainability)}, i.e., which phrases are being responsible for annotating the post as a bully. The proposed multitask framework (mExCB) based on CNN and GRU with word and sub-sentence (SS) level attention is able to outperform several baselines and state of the art models when applied on {\em BullyExplain} dataset.
Enhancing Lidar-based Object Detection in Adverse Weather using Offset Sequences in Time
Authors: Authors: Raphael van Kempen, Tim Rehbronn, Abin Jose, Johannes Stegmaier, Bastian Lampe, Timo Woopen, Lutz Eckstein
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.09049
Pdf link: https://arxiv.org/pdf/2401.09049
Abstract Automated vehicles require an accurate perception of their surroundings for safe and efficient driving. Lidar-based object detection is a widely used method for environment perception, but its performance is significantly affected by adverse weather conditions such as rain and fog. In this work, we investigate various strategies for enhancing the robustness of lidar-based object detection by processing sequential data samples generated by lidar sensors. Our approaches leverage temporal information to improve a lidar object detection model, without the need for additional filtering or pre-processing steps. We compare $10$ different neural network architectures that process point cloud sequences including a novel augmentation strategy introducing a temporal offset between frames of a sequence during training and evaluate the effectiveness of all strategies on lidar point clouds under adverse weather conditions through experiments. Our research provides a comprehensive study of effective methods for mitigating the effects of adverse weather on the reliability of lidar-based object detection using sequential data that are evaluated using public datasets such as nuScenes, Dense, and the Canadian Adverse Driving Conditions Dataset. Our findings demonstrate that our novel method, involving temporal offset augmentation through randomized frame skipping in sequences, enhances object detection accuracy compared to both the baseline model (Pillar-based Object Detection) and no augmentation.
Flow Divergence: Comparing Maps of Flows with Relative Entropy
Authors: Authors: Christopher Blöcker, Ingo Scholtes
Subjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
Arxiv link: https://arxiv.org/abs/2401.09052
Pdf link: https://arxiv.org/pdf/2401.09052
Abstract Networks represent how the entities of a system are connected and can be partitioned differently, prompting ways to compare partitions. Common approaches for comparing network partitions include information-theoretic measures based on mutual information and set-theoretic measures such as the Jaccard index. These measures are often based on computing the agreement in terms of overlap between different partitions of the same set. However, they ignore link patterns which are essential for the organisation of networks. We propose flow divergence, an information-theoretic divergence measure for comparing network partitions, inspired by the ideas behind the Kullback-Leibler divergence and the map equation for community detection. Similar to the Kullback-Leibler divergence, flow divergence adopts a coding perspective and compares two network partitions $\mathsf{M}_a$ and $\mathsf{M}_b$ by considering the expected extra number of bits required to describe a random walk on a network using $\mathsf{M}_b$ relative to reference partition $\mathsf{M}_a$. Because flow divergence is based on random walks, it can be used to compare partitions with arbitrary and different depths. We show that flow divergence distinguishes between partitions that traditional measures consider to be equally good when compared to a reference partition. Applied to real networks, we use flow divergence to estimate the cost of overfitting in incomplete networks and to visualise the solution landscape of network partitions.
Your blush gives you away: detecting hidden mental states with remote photoplethysmography and thermal imaging
Authors: Authors: Ivan Liu, Fangyuan Liu, Qi Zhong, Fei Ma, Shiguang Ni
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2401.09145
Pdf link: https://arxiv.org/pdf/2401.09145
Abstract Multimodal emotion recognition techniques are increasingly essential for assessing mental states. Image-based methods, however, tend to focus predominantly on overt visual cues and often overlook subtler mental state changes. Psychophysiological research has demonstrated that HR and skin temperature are effective in detecting ANS activities, thereby revealing these subtle changes. However, traditional HR tools are generally more costly and less portable, while skin temperature analysis usually necessitates extensive manual processing. Advances in remote-PPG and automatic thermal ROI detection algorithms have been developed to address these issues, yet their accuracy in practical applications remains limited. This study aims to bridge this gap by integrating r-PPG with thermal imaging to enhance prediction performance. Ninety participants completed a 20-minute questionnaire to induce cognitive stress, followed by watching a film aimed at eliciting moral elevation. The results demonstrate that the combination of r-PPG and thermal imaging effectively detects emotional shifts. Using r-PPG alone, the prediction accuracy was 77% for cognitive stress and 61% for moral elevation, as determined by SVM. Thermal imaging alone achieved 79% accuracy for cognitive stress and 78% for moral elevation, utilizing a RF algorithm. An early fusion strategy of these modalities significantly improved accuracies, achieving 87% for cognitive stress and 83% for moral elevation using RF. Further analysis, which utilized statistical metrics and explainable machine learning methods including SHAP, highlighted key features and clarified the relationship between cardiac responses and facial temperature variations. Notably, it was observed that cardiovascular features derived from r-PPG models had a more pronounced influence in data fusion, despite thermal imaging's higher predictive accuracy in unimodal analysis.
DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing
Authors: Authors: Hao Qu, Lilian Zhang, Jun Mao, Junbo Tie, Xiaofeng He, Xiaoping Hu, Yifei Shi, Changhao Chen
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09160
Pdf link: https://arxiv.org/pdf/2401.09160
Abstract Unreliable feature extraction and matching in handcrafted features undermine the performance of visual SLAM in complex real-world scenarios. While learned local features, leveraging CNNs, demonstrate proficiency in capturing high-level information and excel in matching benchmarks, they encounter challenges in continuous motion scenes, resulting in poor generalization and impacting loop detection accuracy. To address these issues, we present DK-SLAM, a monocular visual SLAM system with adaptive deep local features. MAML optimizes the training of these features, and we introduce a coarse-to-fine feature tracking approach. Initially, a direct method approximates the relative pose between consecutive frames, followed by a feature matching method for refined pose estimation. To counter cumulative positioning errors, a novel online learning binary feature-based online loop closure module identifies loop nodes within a sequence. Experimental results underscore DK-SLAM's efficacy, outperforms representative SLAM solutions, such as ORB-SLAM3 on publicly available datasets.
Exploring the Role of Convolutional Neural Networks (CNN) in Dental Radiography Segmentation: A Comprehensive Systematic Literature Review
Authors: Authors: Walid Brahmi, Imen Jdey, Fadoua Drira
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.09190
Pdf link: https://arxiv.org/pdf/2401.09190
Abstract In the field of dentistry, there is a growing demand for increased precision in diagnostic tools, with a specific focus on advanced imaging techniques such as computed tomography, cone beam computed tomography, magnetic resonance imaging, ultrasound, and traditional intra-oral periapical X-rays. Deep learning has emerged as a pivotal tool in this context, enabling the implementation of automated segmentation techniques crucial for extracting essential diagnostic data. This integration of cutting-edge technology addresses the urgent need for effective management of dental conditions, which, if left undetected, can have a significant impact on human health. The impressive track record of deep learning across various domains, including dentistry, underscores its potential to revolutionize early detection and treatment of oral health issues. Objective: Having demonstrated significant results in diagnosis and prediction, deep convolutional neural networks (CNNs) represent an emerging field of multidisciplinary research. The goals of this study were to provide a concise overview of the state of the art, standardize the current debate, and establish baselines for future research. Method: In this study, a systematic literature review is employed as a methodology to identify and select relevant studies that specifically investigate the deep learning technique for dental imaging analysis. This study elucidates the methodological approach, including the systematic collection of data, statistical analysis, and subsequent dissemination of outcomes. Conclusion: This work demonstrates how Convolutional Neural Networks (CNNs) can be employed to analyze images, serving as effective tools for detecting dental pathologies. Although this research acknowledged some limitations, CNNs utilized for segmenting and categorizing teeth exhibited their highest level of performance overall.
Cross-Domain AI for Early Attack Detection and Defense Against Malicious Flows in O-RAN
Authors: Authors: Bruno Missi Xavier, Merim Dzaferagic, Irene Vilà, Magnos Martinello, Marco Ruffini
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2401.09204
Pdf link: https://arxiv.org/pdf/2401.09204
Abstract Only the chairs can edit In the fight against cyber attacks, Network Softwarization (NS) is a flexible and adaptable shield, using advanced software to spot malicious activity in regular network traffic. However, the availability of comprehensive datasets for mobile networks, which are fundamental for the development of Machine Learning (ML) solutions for attack detection near their source, is still limited. Cross-Domain Artificial Intelligence (AI) can be the key to address this, although its application in Open Radio Access Network (O-RAN) is still at its infancy. To address these challenges, we deployed an end-to-end O-RAN network, that was used to collect data from the RAN and the transport network. These datasets allow us to combine the knowledge from an in-network ML traffic classifier for attack detection to bolster the training of an ML-based traffic classifier specifically tailored for the RAN. Our results demonstrate the potential of the proposed approach, achieving an accuracy rate of 93%. This approach not only bridges critical gaps in mobile network security but also showcases the potential of cross-domain AI in enhancing the efficacy of network security measures.
Neural Network Equalizers and Successive Interference Cancellation for Bandlimited Channels with a Nonlinearity
Authors: Authors: Daniel Plabst, Tobias Prinz, Francesca Diedolo, Thomas Wiegart, Georg Böcherer, Norbert Hanik, Gerhard Kramer
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2401.09217
Pdf link: https://arxiv.org/pdf/2401.09217
Abstract Neural networks (NNs) inspired by the forward-backward algorithm (FBA) are used as equalizers for bandlimited channels with a memoryless nonlinearity. The NN-equalizers are combined with successive interference cancellation (SIC) to approach the information rates of joint detection and decoding (JDD) with considerably less complexity than JDD and other existing equalizers. Simulations for short-haul optical fiber links with square-law detection illustrate the gains of NNs as compared to the complexity-limited FBA and Gibbs sampling.
Dynamic Relation Transformer for Contextual Text Block Detection
Authors: Authors: Jiawei Wang, Shunchi Zhang, Kai Hu, Chixiang Ma, Zhuoyao Zhong, Lei Sun, Qiang Huo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09232
Pdf link: https://arxiv.org/pdf/2401.09232
Abstract Contextual Text Block Detection (CTBD) is the task of identifying coherent text blocks within the complexity of natural scenes. Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem from the perspective of natural language processing. We introduce a new framework that frames CTBD as a graph generation problem. This methodology consists of two essential procedures: identifying individual text units as graph nodes and discerning the sequential reading order relationships among these units as graph edges. Leveraging the cutting-edge capabilities of DQ-DETR for node detection, our framework innovates further by integrating a novel mechanism, a Dynamic Relation Transformer (DRFormer), dedicated to edge generation. DRFormer incorporates a dual interactive transformer decoder that deftly manages a dynamic graph structure refinement process. Through this iterative process, the model systematically enhances the graph's fidelity, ultimately resulting in improved precision in detecting contextual text blocks. Comprehensive experimental evaluations conducted on both SCUT-CTW-Context and ReCTS-Context datasets substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our graph generation framework in advancing the field of CTBD.
Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges
Authors: Authors: Aiqi Jiang, Arkaitz Zubiaga
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.09244
Pdf link: https://arxiv.org/pdf/2401.09244
Abstract The growing prevalence and rapid evolution of offensive language in social media amplify the complexities of detection, particularly highlighting the challenges in identifying such content across diverse languages. This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning (CLTL) techniques in offensive language detection in social media. Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain. We analyse 67 relevant papers and categorise these studies across various dimensions, including the characteristics of multilingual datasets used, the cross-lingual resources employed, and the specific CLTL strategies implemented. According to "what to transfer", we also summarise three main CLTL transfer approaches: instance, feature, and parameter transfer. Additionally, we shed light on the current challenges and future research opportunities in this field. Furthermore, we have made our survey resources available online, including two comprehensive tables that provide accessible references to the multilingual datasets and CLTL methods used in the reviewed literature.
PixelDINO: Semi-Supervised Semantic Segmentation for Detecting Permafrost Disturbances
Authors: Authors: Konrad Heidler, Ingmar Nitze, Guido Grosse, Xiao Xiang Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09271
Pdf link: https://arxiv.org/pdf/2401.09271
Abstract Arctic Permafrost is facing significant changes due to global climate change. As these regions are largely inaccessible, remote sensing plays a crucial rule in better understanding the underlying processes not just on a local scale, but across the Arctic. In this study, we focus on the remote detection of retrogressive thaw slumps (RTS), a permafrost disturbance comparable to landslides induced by thawing. For such analyses from space, deep learning has become an indispensable tool, but limited labelled training data remains a challenge for training accurate models. To improve model generalization across the Arctic without the need for additional labelled data, we present a semi-supervised learning approach to train semantic segmentation models to detect RTS. Our framework called PixelDINO is trained in parallel on labelled data as well as unlabelled data. For the unlabelled data, the model segments the imagery into self-taught pseudo-classes and the training procedure ensures consistency of these pseudo-classes across strong augmentations of the input data. Our experimental results demonstrate that PixelDINO can improve model performance both over supervised baseline methods as well as existing semi-supervised semantic segmentation approaches, highlighting its potential for training robust models that generalize well to regions that were not included in the training data. The project page containing code and other materials for this study can be found at \url{https://khdlr.github.io/PixelDINO/}.
Hot Fixing Software: A Comprehensive Review of Terminology, Techniques, and Applications
Authors: Authors: Carol Hanna, David Clark, Federica Sarro, Justyna Petke
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2401.09275
Pdf link: https://arxiv.org/pdf/2401.09275
Abstract A hot fix is an improvement to a specific time-critical issue deployed to a software system in production. While hot fixing is an essential and common activity in software maintenance, it has never been surveyed as a research activity. Thus, such a review is long overdue. In this paper, we conduct a comprehensive literature review of work on hot fixing. We highlight the fields where this topic has been addressed, inconsistencies we identified in the terminology, gaps in the literature, and directions for future work. Our search concluded with 91 papers on the topic between the year 2000 and 2022. The papers found encompass many different research areas such as log analysis, runtime patching (also known as hot patching), and automated repair, as well as various application domains such as security, mobile, and video games. We find that there are many directions that can take hot fix research forward such as unifying existing terminology, establishing a benchmark set of hot fixes, researching costs and frequency of hot fixes, and researching the possibility of end-to-end automation of detection, mitigation, and propagation. We discuss these avenues in detail to inspire the community to systematize hot fixing as a software engineering activity. We hope that this paper streamlines the existing body of work and drives research in the area forward.
Siamese Meets Diffusion Network: SMDNet for Enhanced Change Detection in High-Resolution RS Imagery
Authors: Authors: Jia Jia, Geunho Lee, Zhibo Wang, Lyu Zhi, Yuchu He
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09325
Pdf link: https://arxiv.org/pdf/2401.09325
Abstract Recently, the application of deep learning to change detection (CD) has significantly progressed in remote sensing images. In recent years, CD tasks have mostly used architectures such as CNN and Transformer to identify these changes. However, these architectures have shortcomings in representing boundary details and are prone to false alarms and missed detections under complex lighting and weather conditions. For that, we propose a new network, Siamese Meets Diffusion Network (SMDNet). This network combines the Siam-U2Net Feature Differential Encoder (SU-FDE) and the denoising diffusion implicit model to improve the accuracy of image edge change detection and enhance the model's robustness under environmental changes. First, we propose an innovative SU-FDE module that utilizes shared weight features to capture differences between time series images and identify similarities between features to enhance edge detail detection. Furthermore, we add an attention mechanism to identify key coarse features to improve the model's sensitivity and accuracy. Finally, the diffusion model of progressive sampling is used to fuse key coarse features, and the noise reduction ability of the diffusion model and the advantages of capturing the probability distribution of image data are used to enhance the adaptability of the model in different environments. Our method's combination of feature extraction and diffusion models demonstrates effectiveness in change detection in remote sensing images. The performance evaluation of SMDNet on LEVIR-CD, DSIFN-CD, and CDD datasets yields validated F1 scores of 90.99%, 88.40%, and 88.47%, respectively. This substantiates the advanced capabilities of our model in accurately identifying variations and intricate details.
Detection of Distributed Denial of Service Attacks Carried Out by Botnets in Software-Defined Networks
Authors: Authors: Jaime Tamayo, Lorena Isabel Barona López, Ángel Leonardo Valdivieso Caraguay
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2401.09358
Pdf link: https://arxiv.org/pdf/2401.09358
Abstract Recent years witnessed a surge in network traffic due to the emergence of new online services, causing periodic saturation and complexity problems. Additionally, the growing number of IoT devices further compounds the problem. Software Defined Network (SDN) is a new architecture which offers innovative advantages that help to reduce saturation problems. Despite its benefits, SDNs not only can be affected by traditional attacks but also introduce new security challenges. In this context, Distributed Denial of Service (DDoS) is one of the most important attacks that can damage an SDN network's normal operation. Furthermore, if these attacks are executed using botnets, they can use thousands of compromised devices to disrupt critical online services. This paper proposes a framework for detecting DDoS attacks generated by a group of botnets in an SDN network. The framework is implemented using open-source tools such as Mininet and OpenDaylight and tested in a centralized network topology using BYOB and SNORT. The results demonstrate real-time attack identification by implementing an intrusion detection mechanism in the victim client. Our proposed solution offers quick and effective detection of DDoS attacks in SDN networks. The framework can successfully differentiate the type of attack with high accuracy in a short time
Deciphering Textual Authenticity: A Generalized Strategy through the Lens of Large Language Semantics for Detecting Human vs. Machine-Generated Text
Authors: Authors: Mazal Bethany, Brandon Wherry, Emet Bethany, Nishant Vishwamitra, Peyman Najafirad
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.09407
Pdf link: https://arxiv.org/pdf/2401.09407
Abstract With the recent proliferation of Large Language Models (LLMs), there has been an increasing demand for tools to detect machine-generated text. The effective detection of machine-generated text face two pertinent problems: First, they are severely limited in generalizing against real-world scenarios, where machine-generated text is produced by a variety of generators, including but not limited to GPT-4 and Dolly, and spans diverse domains, ranging from academic manuscripts to social media posts. Second, existing detection methodologies treat texts produced by LLMs through a restrictive binary classification lens, neglecting the nuanced diversity of artifacts generated by different LLMs. In this work, we undertake a systematic study on the detection of machine-generated text in real-world scenarios. We first study the effectiveness of state-of-the-art approaches and find that they are severely limited against text produced by diverse generators and domains in the real world. Furthermore, t-SNE visualizations of the embeddings from a pretrained LLM's encoder show that they cannot reliably distinguish between human and machine-generated text. Based on our findings, we introduce a novel system, T5LLMCipher, for detecting machine-generated text using a pretrained T5 encoder combined with LLM embedding sub-clustering to address the text produced by diverse generators and domains in the real world. We evaluate our approach across 9 machine-generated text systems and 9 domains and find that our approach provides state-of-the-art generalization ability, with an average increase in F1 score on machine-generated text of 19.6\% on unseen generators and domains compared to the top performing existing approaches and correctly attributes the generator of text with an accuracy of 93.6\%.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Authors: Authors: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.09417
Pdf link: https://arxiv.org/pdf/2401.09417
Abstract Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have shown great potential for long sequence modeling. Building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance of visual representation learning on self-attention is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to become the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.
Keyword: face recognition

PPR: Enhancing Dodging Attacks while Maintaining Impersonation Attacks on Face Recognition Systems
Authors: Authors: Fengfan Zhou, Heifei Ling
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.08903
Pdf link: https://arxiv.org/pdf/2401.08903
Abstract Adversarial Attacks on Face Recognition (FR) encompass two types: impersonation attacks and evasion attacks. We observe that achieving a successful impersonation attack on FR does not necessarily ensure a successful dodging attack on FR in the black-box setting. Introducing a novel attack method named Pre-training Pruning Restoration Attack (PPR), we aim to enhance the performance of dodging attacks whilst avoiding the degradation of impersonation attacks. Our method employs adversarial example pruning, enabling a portion of adversarial perturbations to be set to zero, while tending to maintain the attack performance. By utilizing adversarial example pruning, we can prune the pre-trained adversarial examples and selectively free up certain adversarial perturbations. Thereafter, we embed adversarial perturbations in the pruned area, which enhances the dodging performance of the adversarial face examples. The effectiveness of our proposed attack method is demonstrated through our experimental results, showcasing its superior performance.
Keyword: augmentation

Contrastive Learning with Negative Sampling Correction
Authors: Authors: Lu Wang, Chao Du, Pu Zhao, Chuan Luo, Zhangchi Zhu, Bo Qiao, Wei Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.08690
Pdf link: https://arxiv.org/pdf/2401.08690
Abstract As one of the most effective self-supervised representation learning methods, contrastive learning (CL) relies on multiple negative pairs to contrast against each positive pair. In the standard practice of contrastive learning, data augmentation methods are utilized to generate both positive and negative pairs. While existing works have been focusing on improving the positive sampling, the negative sampling process is often overlooked. In fact, the generated negative samples are often polluted by positive samples, which leads to a biased loss and performance degradation. To correct the negative sampling bias, we propose a novel contrastive learning method named Positive-Unlabeled Contrastive Learning (PUCL). PUCL treats the generated negative samples as unlabeled samples and uses information from positive samples to correct bias in contrastive loss. We prove that the corrected loss used in PUCL only incurs a negligible bias compared to the unbiased contrastive loss. PUCL can be applied to general contrastive learning problems and outperforms state-of-the-art methods on various image and graph classification tasks. The code of PUCL is in the supplementary file.
On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations
Authors: Authors: Matthew C. McCallum, Matthew E. P. Davies, Florian Henkel, Jaehun Kim, Samuel E. Sandberg
Subjects: Sound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.08889
Pdf link: https://arxiv.org/pdf/2401.08889
Abstract Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.
Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search
Authors: Authors: Matthew C. McCallum, Florian Henkel, Jaehun Kim, Samuel E. Sandberg, Matthew E. P. Davies
Subjects: Sound (cs.SD); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.08902
Pdf link: https://arxiv.org/pdf/2401.08902
Abstract Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces representing specific, yet possibly correlated, attributes can be weighted to emphasize those attributes in downstream tasks. However, no research has been conducted into the independence of these subspaces, nor their manipulation, in order to retrieve tracks that are similar but different in a specific way. Here, we explore the manipulation of tempo in embedding spaces as a case-study towards this goal. We propose tempo translation functions that allow for efficient manipulation of tempo within a pre-existing embedding space whilst maintaining other properties such as genre. As this translation is specific to tempo it enables retrieval of tracks that are similar but have specifically different tempi. We show that such a function can be used as an efficient data augmentation strategy for both training of downstream tempo predictors, and improved nearest neighbor retrieval of properties largely independent of tempo.
Augmenting Math Word Problems via Iterative Question Composing
Authors: Authors: Haoxiong Liu, Andrew Chi-Chih Yao
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.09003
Pdf link: https://arxiv.org/pdf/2401.09003
Abstract Despite recent progress in improving the mathematical reasoning ability of large language models(LLMs), solving competition-level math problems without the use of external tools remains challenging for open-source LLMs. In this work, we introduce the MMIQC dataset, a mixture of processed web data and synthetic question-response pairs, to equip base models with better mathematical reasoning skills. Mistral-7B-MMIQC, the model obtained by fine-tuning Mistral-7B(arXiv:2310.06825) on MMIQC, achieves 36.0\% accuracy on MATH(arXiv:2103.03874), 5.8\% higher than the previous (model size $\sim$7B) SOTA. Our experiments also show that a large part of the improvement attributes to our novel augmentation method IQC(Iterative Question Composing), where we iteratively ask an LLM to compose new questions from the given seed problems and do rejection sampling from another LLM. MMIQC has now been released on https://huggingface.co/datasets/Vivacem/MMIQC.
Enhancing Lidar-based Object Detection in Adverse Weather using Offset Sequences in Time
Authors: Authors: Raphael van Kempen, Tim Rehbronn, Abin Jose, Johannes Stegmaier, Bastian Lampe, Timo Woopen, Lutz Eckstein
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.09049
Pdf link: https://arxiv.org/pdf/2401.09049
Abstract Automated vehicles require an accurate perception of their surroundings for safe and efficient driving. Lidar-based object detection is a widely used method for environment perception, but its performance is significantly affected by adverse weather conditions such as rain and fog. In this work, we investigate various strategies for enhancing the robustness of lidar-based object detection by processing sequential data samples generated by lidar sensors. Our approaches leverage temporal information to improve a lidar object detection model, without the need for additional filtering or pre-processing steps. We compare $10$ different neural network architectures that process point cloud sequences including a novel augmentation strategy introducing a temporal offset between frames of a sequence during training and evaluate the effectiveness of all strategies on lidar point clouds under adverse weather conditions through experiments. Our research provides a comprehensive study of effective methods for mitigating the effects of adverse weather on the reliability of lidar-based object detection using sequential data that are evaluated using public datasets such as nuScenes, Dense, and the Canadian Adverse Driving Conditions Dataset. Our findings demonstrate that our novel method, involving temporal offset augmentation through randomized frame skipping in sequences, enhances object detection accuracy compared to both the baseline model (Pillar-based Object Detection) and no augmentation.
Knowledge Pyramid: A Novel Hierarchical Reasoning Structure for Generalized Knowledge Augmentation and Inference
Authors: Authors: Qinghua Huang, Yongzhen Wang
Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2401.09070
Pdf link: https://arxiv.org/pdf/2401.09070
Abstract Knowledge graph (KG) based reasoning has been regarded as an effective means for the analysis of semantic networks and is of great usefulness in areas of information retrieval, recommendation, decision-making, and man-machine interaction. It is widely used in recommendation, decision-making, question-answering, search, and other fields. However, previous studies mainly used low-level knowledge in the KG for reasoning, which may result in insufficient generalization and poor robustness of reasoning. To this end, this paper proposes a new inference approach using a novel knowledge augmentation strategy to improve the generalization capability of KG. This framework extracts high-level pyramidal knowledge from low-level knowledge and applies it to reasoning in a multi-level hierarchical KG, called knowledge pyramid in this paper. We tested some medical data sets using the proposed approach, and the experimental results show that the proposed knowledge pyramid has improved the knowledge inference performance with better generalization. Especially, when there are fewer training samples, the inference accuracy can be significantly improved.
Trapped in texture bias? A large scale comparison of deep instance segmentation
Authors: Authors: Johannes Theodoridis, Jessica Hofmann, Johannes Maucher, Andreas Schilling
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09109
Pdf link: https://arxiv.org/pdf/2401.09109
Abstract Do deep learning models for instance segmentation generalize to novel objects in a systematic way? For classification, such behavior has been questioned. In this study, we aim to understand if certain design decisions such as framework, architecture or pre-training contribute to the semantic understanding of instance segmentation. To answer this question, we consider a special case of robustness and compare pre-trained models on a challenging benchmark for object-centric, out-of-distribution texture. We do not introduce another method in this work. Instead, we take a step back and evaluate a broad range of existing literature. This includes Cascade and Mask R-CNN, Swin Transformer, BMask, YOLACT(++), DETR, BCNet, SOTR and SOLOv2. We find that YOLACT++, SOTR and SOLOv2 are significantly more robust to out-of-distribution texture than other frameworks. In addition, we show that deeper and dynamic architectures improve robustness whereas training schedules, data augmentation and pre-training have only a minor impact. In summary we evaluate 68 models on 61 versions of MS COCO for a total of 4148 evaluations.
An Efficient Generalizable Framework for Visuomotor Policies via Control-aware Augmentation and Privilege-guided Distillation
Authors: Authors: Yinuo Zhao, Kun Wu, Tianjiao Yi, Zhiyuan Xu, Xiaozhu Ju, Zhengping Che, Qinru Qiu, Chi Harold Liu, Jian Tang
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09258
Pdf link: https://arxiv.org/pdf/2401.09258
Abstract Visuomotor policies, which learn control mechanisms directly from high-dimensional visual observations, confront challenges in adapting to new environments with intricate visual variations. Data augmentation emerges as a promising method for bridging these generalization gaps by enriching data variety. However, straightforwardly augmenting the entire observation shall impose excessive burdens on policy learning and may even result in performance degradation. In this paper, we propose to improve the generalization ability of visuomotor policies as well as preserve training stability from two aspects: 1) We learn a control-aware mask through a self-supervised reconstruction task with three auxiliary losses and then apply strong augmentation only to those control-irrelevant regions based on the mask to reduce the generalization gaps. 2) To address training instability issues prevalent in visual reinforcement learning (RL), we distill the knowledge from a pretrained RL expert processing low-level environment states, to the student visuomotor policy. The policy is subsequently deployed to unseen environments without any further finetuning. We conducted comparison and ablation studies across various benchmarks: the DMControl Generalization Benchmark (DMC-GB), the enhanced Robot Manipulation Distraction Benchmark (RMDB), and a specialized long-horizontal drawer-opening robotic task. The extensive experimental results well demonstrate the effectiveness of our method, e.g., showing a 17\% improvement over previous methods in the video-hard setting of DMC-GB.
PixelDINO: Semi-Supervised Semantic Segmentation for Detecting Permafrost Disturbances
Authors: Authors: Konrad Heidler, Ingmar Nitze, Guido Grosse, Xiao Xiang Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09271
Pdf link: https://arxiv.org/pdf/2401.09271
Abstract Arctic Permafrost is facing significant changes due to global climate change. As these regions are largely inaccessible, remote sensing plays a crucial rule in better understanding the underlying processes not just on a local scale, but across the Arctic. In this study, we focus on the remote detection of retrogressive thaw slumps (RTS), a permafrost disturbance comparable to landslides induced by thawing. For such analyses from space, deep learning has become an indispensable tool, but limited labelled training data remains a challenge for training accurate models. To improve model generalization across the Arctic without the need for additional labelled data, we present a semi-supervised learning approach to train semantic segmentation models to detect RTS. Our framework called PixelDINO is trained in parallel on labelled data as well as unlabelled data. For the unlabelled data, the model segments the imagery into self-taught pseudo-classes and the training procedure ensures consistency of these pseudo-classes across strong augmentations of the input data. Our experimental results demonstrate that PixelDINO can improve model performance both over supervised baseline methods as well as existing semi-supervised semantic segmentation approaches, highlighting its potential for training robust models that generalize well to regions that were not included in the training data. The project page containing code and other materials for this study can be found at \url{https://khdlr.github.io/PixelDINO/}.
Tri$^{2}$-plane: Volumetric Avatar Reconstruction with Feature Pyramid
Authors: Authors: Luchuan Song, Pinxin Liu, Lele Chen, Celong Liu, Chenliang Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.09386
Pdf link: https://arxiv.org/pdf/2401.09386
Abstract Recent years have witnessed considerable achievements in facial avatar reconstruction with neural volume rendering. Despite notable advancements, the reconstruction of complex and dynamic head movements from monocular videos still suffers from capturing and restoring fine-grained details. In this work, we propose a novel approach, named Tri$^2$-plane, for monocular photo-realistic volumetric head avatar reconstructions. Distinct from the existing works that rely on a single tri-plane deformation field for dynamic facial modeling, the proposed Tri$^2$-plane leverages the principle of feature pyramids and three top-to-down lateral connections tri-planes for details improvement. It samples and renders facial details at multiple scales, transitioning from the entire face to specific local regions and then to even more refined sub-regions. Moreover, we incorporate a camera-based geometry-aware sliding window method as an augmentation in training, which improves the robustness beyond the canonical space, with a particular improvement in cross-identity generation capabilities. Experimental outcomes indicate that the Tri$^2$-plane not only surpasses existing methodologies but also achieves superior performance across both quantitative metrics and qualitative assessments through experiments.

LeeKyungwook / get-arxiv-noti

New submissions for Thu, 18 Jan 24 #939

Keyword: detection

Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer Vision

Improved Pothole Detection Using YOLOv7 and ESRGAN

Online Anomaly Detection over Live Social Video Streaming

Immature Green Apple Detection and Sizing in Commercial Orchards using YOLOv8 and Shape Fitting Techniques

Attention Modules Improve Modern Image-Level Anomaly Detection: A DifferNet Case Study

DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception

NODI: Out-Of-Distribution Detection with Noise from Diffusion

AiGen-FoodReview: A Multimodal Dataset of Machine-Generated Restaurant Reviews and Images on Social Media

Image Fusion in Remote Sensing: An Overview and Meta Analysis

Exploring Content-Based and Meta-Data Analysis for Detecting Fake News Infodemic: A case study on COVID-19

Benchmarking Particle Filter Algorithms for Efficient Velodyne-Based Vehicle Localization

Learning to detect cloud and snow in remote sensing images from noisy labels

AntiPhishStack: LSTM-based Stacked Generalization Model for Optimized Phishing URLs Detection

Hearing Loss Detection from Facial Expressions in One-on-one Conversations

A GAN-based data poisoning framework against anomaly detection in vertical federated learning

Knight Watch: Ubiquitous Computing Enhancements To Sleep Quality With Acoustic Analysis

Explain Thyself Bully: Sentiment Aided Cyberbullying Detection with Explanation

Enhancing Lidar-based Object Detection in Adverse Weather using Offset Sequences in Time

Flow Divergence: Comparing Maps of Flows with Relative Entropy

Your blush gives you away: detecting hidden mental states with remote photoplethysmography and thermal imaging

DK-SLAM: Monocular Visual SLAM with Deep Keypoints Adaptive Learning, Tracking and Loop-Closing

Exploring the Role of Convolutional Neural Networks (CNN) in Dental Radiography Segmentation: A Comprehensive Systematic Literature Review

Cross-Domain AI for Early Attack Detection and Defense Against Malicious Flows in O-RAN

Neural Network Equalizers and Successive Interference Cancellation for Bandlimited Channels with a Nonlinearity

Dynamic Relation Transformer for Contextual Text Block Detection

Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges

PixelDINO: Semi-Supervised Semantic Segmentation for Detecting Permafrost Disturbances

Hot Fixing Software: A Comprehensive Review of Terminology, Techniques, and Applications

Siamese Meets Diffusion Network: SMDNet for Enhanced Change Detection in High-Resolution RS Imagery

Detection of Distributed Denial of Service Attacks Carried Out by Botnets in Software-Defined Networks

Deciphering Textual Authenticity: A Generalized Strategy through the Lens of Large Language Semantics for Detecting Human vs. Machine-Generated Text

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Keyword: face recognition

PPR: Enhancing Dodging Attacks while Maintaining Impersonation Attacks on Face Recognition Systems

Keyword: augmentation

Contrastive Learning with Negative Sampling Correction

On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Augmenting Math Word Problems via Iterative Question Composing

Enhancing Lidar-based Object Detection in Adverse Weather using Offset Sequences in Time

Knowledge Pyramid: A Novel Hierarchical Reasoning Structure for Generalized Knowledge Augmentation and Inference

Trapped in texture bias? A large scale comparison of deep instance segmentation

An Efficient Generalizable Framework for Visuomotor Policies via Control-aware Augmentation and Privilege-guided Distillation

PixelDINO: Semi-Supervised Semantic Segmentation for Detecting Permafrost Disturbances

Tri$^{2}$-plane: Volumetric Avatar Reconstruction with Feature Pyramid