Abstract
Face morphing attacks present an emerging threat to the face recognition system. On top of that, printing and scanning the morphed images could obscure the artifacts generated during the morphing process, which makes morphed image detection even harder. In this work, we investigate the impact that printing and scanning has on morphing attacks through a series of heterogeneous tests. Our experiments show that we can increase the possibility of a false match by up to 5.64% for DiM and 16.00% for StyleGAN2 when providing an image that has been printed and scanned, regardless it is morphed or bona fide, to a Face Recognition (FR) system. Likewise, using Frechet Inception Distance (FID) metric, strictly print-scanned morph attacks performed on average 9.185% stronger than non-print-scanned digital morphs.
MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection
Abstract
Recent advancements in anomaly detection have seen the efficacy of CNN- and transformer-based approaches. However, CNNs struggle with long-range dependencies, while transformers are burdened by quadratic computational complexity. Mamba-based models, with their superior long-range modeling and linear efficiency, have garnered substantial attention. This study pioneers the application of Mamba to multi-class unsupervised anomaly detection, presenting MambaAD, which consists of a pre-trained encoder and a Mamba decoder featuring Locality-Enhanced State Space (LSS) modules at multi-scales. The proposed LSS module, integrating parallel cascaded (Hybrid State Space) HSS blocks and multi-kernel convolutions operations, effectively captures both long-range and local information. The HSS block, utilizing (Hybrid Scanning) HS encoders, encodes feature maps into five scanning methods and eight directions, thereby strengthening global connections through the (State Space Model) SSM. The use of Hilbert scanning and eight directions significantly improves feature sequence modeling. Comprehensive experiments on six diverse anomaly detection datasets and seven metrics demonstrate SoTA performance, substantiating the method's effectiveness.
Detecting Refactoring Commits in Machine Learning Python Projects: A Machine Learning-Based Approach
Abstract
Refactoring enhances software quality without altering its functional behaviors. Understanding the refactoring activities of developers is crucial to improving software maintainability. With the increasing use of machine learning (ML) libraries and frameworks, maximizing their maintainability is crucial. Due to the data-driven nature of ML projects, they often undergo different refactoring operations (e.g., data manipulation), for which existing refactoring tools lack ML-specific detection capabilities. Furthermore, a large number of ML libraries are written in Python, which has limited tools for refactoring detection. PyRef, a rule-based and state-of-the-art tool for Python refactoring detection, can identify 11 types of refactoring operations. In comparison, Rminer can detect 99 types of refactoring for Java projects. We introduce MLRefScanner, a prototype tool that applies machine-learning techniques to detect refactoring commits in ML Python projects. MLRefScanner identifies commits with both ML-specific and general refactoring operations. Evaluating MLRefScanner on 199 ML projects demonstrates its superior performance compared to state-of-the-art approaches, achieving an overall 94% precision and 82% recall. Combining it with PyRef further boosts performance to 95% precision and 99% recall. Our study highlights the potential of ML-driven approaches in detecting refactoring across diverse programming languages and technical domains, addressing the limitations of rule-based detection methods.
FlameFinder: Illuminating Obscured Fire through Smoke with Attentive Deep Metric Learning
Authors: Authors: Hossein Rajoli, Sahand Khoshdel, Fatemeh Afghah, Xiaolong Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
FlameFinder is a deep metric learning (DML) framework designed to accurately detect flames, even when obscured by smoke, using thermal images from firefighter drones during wildfire monitoring. Traditional RGB cameras struggle in such conditions, but thermal cameras can capture smoke-obscured flame features. However, they lack absolute thermal reference points, leading to false positives.To address this issue, FlameFinder utilizes paired thermal-RGB images for training. By learning latent flame features from smoke-free samples, the model becomes less biased towards relative thermal gradients. In testing, it identifies flames in smoky patches by analyzing their equivalent thermal-domain distribution. This method improves performance using both supervised and distance-based clustering metrics.The framework incorporates a flame segmentation method and a DML-aided detection framework. This includes utilizing center loss (CL), triplet center loss (TCL), and triplet cosine center loss (TCCL) to identify optimal cluster representatives for classification. However, the dominance of center loss over the other losses leads to the model missing features sensitive to them. To address this limitation, an attention mechanism is proposed. This mechanism allows for non-uniform feature contribution, amplifying the critical role of cosine and triplet loss in the DML framework. Additionally, it improves interpretability, class discrimination, and decreases intra-class variance. As a result, the proposed model surpasses the baseline by 4.4% in the FLAME2 dataset and 7% in the FLAME3 dataset for unobscured flame detection accuracy. Moreover, it demonstrates enhanced class separation in obscured scenarios compared to VGG19, ResNet18, and three backbone models tailored for flame detection.
Multi-modal Document Presentation Attack Detection With Forensics Trace Disentanglement
Abstract
Document Presentation Attack Detection (DPAD) is an important measure in protecting the authenticity of a document image. However, recent DPAD methods demand additional resources, such as manual effort in collecting additional data or knowing the parameters of acquisition devices. This work proposes a DPAD method based on multi-modal disentangled traces (MMDT) without the above drawbacks. We first disentangle the recaptured traces by a self-supervised disentanglement and synthesis network to enhance the generalization capacity in document images with different contents and layouts. Then, unlike the existing DPAD approaches that rely only on data in the RGB domain, we propose to explicitly employ the disentangled recaptured traces as new modalities in the transformer backbone through adaptive multi-modal adapters to fuse RGB/trace features efficiently. Visualization of the disentangled traces confirms the effectiveness of the proposed method in different document contents. Extensive experiments on three benchmark datasets demonstrate the superiority of our MMDT method on representing forensic traces of recapturing distortion.
What's Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs
Authors: Authors: Anna Wegmann, Tijs van den Broek, Dong Nguyen
Abstract
Best practices for high conflict conversations like counseling or customer support almost always include recommendations to paraphrase the previous speaker. Although paraphrase classification has received widespread attention in NLP, paraphrases are usually considered independent from context, and common models and datasets are not applicable to dialog settings. In this work, we investigate paraphrases in dialog (e.g., Speaker 1: "That book is mine." becomes Speaker 2: "That book is yours."). We provide an operationalization of context-dependent paraphrases, and develop a training for crowd-workers to classify paraphrases in dialog. We introduce a dataset with utterance pairs from NPR and CNN news interviews annotated for context-dependent paraphrases. To enable analyses on label variation, the dataset contains 5,581 annotations on 600 utterance pairs. We present promising results with in-context learning and with token classification models for automatic paraphrase detection in dialog.
Unsupervised Visible-Infrared ReID via Pseudo-label Correction and Modality-level Alignment
Authors: Authors: Yexin Liu, Weiming Zhang, Athanasios V. Vasilakos, Lin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Unsupervised visible-infrared person re-identification (UVI-ReID) has recently gained great attention due to its potential for enhancing human detection in diverse environments without labeling. Previous methods utilize intra-modality clustering and cross-modality feature matching to achieve UVI-ReID. However, there exist two challenges: 1) noisy pseudo labels might be generated in the clustering process, and 2) the cross-modality feature alignment via matching the marginal distribution of visible and infrared modalities may misalign the different identities from two modalities. In this paper, we first conduct a theoretic analysis where an interpretable generalization upper bound is introduced. Based on the analysis, we then propose a novel unsupervised cross-modality person re-identification framework (PRAISE). Specifically, to address the first challenge, we propose a pseudo-label correction strategy that utilizes a Beta Mixture Model to predict the probability of mis-clustering based network's memory effect and rectifies the correspondence by adding a perceptual term to contrastive learning. Next, we introduce a modality-level alignment strategy that generates paired visible-infrared latent features and reduces the modality gap by aligning the labeling function of visible and infrared features to learn identity discriminative and modality-invariant features. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance than the unsupervised visible-ReID methods.
Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting
Abstract
The emergence of Multi-Camera 3D Object Detection (MC3D-Det), facilitated by bird's-eye view (BEV) representation, signifies a notable progression in 3D object detection. Scaling MC3D-Det training effectively accommodates varied camera parameters and urban landscapes, paving the way for the MC3D-Det foundation model. However, the multi-view fusion stage of the MC3D-Det method relies on the ill-posed monocular perception during training rather than surround refinement ability, leading to what we term "surround refinement degradation". To this end, our study presents a weak-to-strong eliciting framework aimed at enhancing surround refinement while maintaining robust monocular perception. Specifically, our framework employs weakly tuned experts trained on distinct subsets, and each is inherently biased toward specific camera configurations and scenarios. These biased experts can learn the perception of monocular degeneration, which can help the multi-view fusion stage to enhance surround refinement abilities. Moreover, a composite distillation strategy is proposed to integrate the universal knowledge of 2D foundation models and task-specific information. Finally, for MC3D-Det joint training, the elaborate dataset merge strategy is designed to solve the problem of inconsistent camera numbers and camera parameters. We set up a multiple dataset joint training benchmark for MC3D-Det and adequately evaluated existing methods. Further, we demonstrate the proposed framework brings a generalized and significant boost over multiple baselines. Our code is at \url{https://github.com/EnVision-Research/Scale-BEV}.
Sparse Points to Dense Clouds: Enhancing 3D Detection with Limited LiDAR Data
Abstract
3D detection is a critical task that enables machines to identify and locate objects in three-dimensional space. It has a broad range of applications in several fields, including autonomous driving, robotics and augmented reality. Monocular 3D detection is attractive as it requires only a single camera, however, it lacks the accuracy and robustness required for real world applications. High resolution LiDAR on the other hand, can be expensive and lead to interference problems in heavy traffic given their active transmissions. We propose a balanced approach that combines the advantages of monocular and point cloud-based 3D detection. Our method requires only a small number of 3D points, that can be obtained from a low-cost, low-resolution sensor. Specifically, we use only 512 points, which is just 1% of a full LiDAR frame in the KITTI dataset. Our method reconstructs a complete 3D point cloud from this limited 3D information combined with a single image. The reconstructed 3D point cloud and corresponding image can be used by any multi-modal off-the-shelf detector for 3D object detection. By using the proposed network architecture with an off-the-shelf multi-modal 3D detector, the accuracy of 3D detection improves by 20% compared to the state-of-the-art monocular detection methods and 6% to 9% compare to the baseline multi-modal methods on KITTI and JackRabbot datasets.
Transferable and Efficient Non-Factual Content Detection via Probe Training with Offline Consistency Checking
Abstract
Detecting non-factual content is a longstanding goal to increase the trustworthiness of large language models (LLMs) generations. Current factuality probes, trained using humanannotated labels, exhibit limited transferability to out-of-distribution content, while online selfconsistency checking imposes extensive computation burden due to the necessity of generating multiple outputs. This paper proposes PINOSE, which trains a probing model on offline self-consistency checking results, thereby circumventing the need for human-annotated data and achieving transferability across diverse data distributions. As the consistency check process is offline, PINOSE reduces the computational burden of generating multiple responses by online consistency verification. Additionally, it examines various aspects of internal states prior to response decoding, contributing to more effective detection of factual inaccuracies. Experiment results on both factuality detection and question answering benchmarks show that PINOSE achieves surpassing results than existing factuality detection methods. Our code and datasets are publicly available on this anonymized repository.
MedRG: Medical Report Grounding with Multi-modal Large Language Model
Authors: Authors: Ke Zou, Yang Bai, Zhihao Chen, Yang Zhou, Yidi Chen, Kai Ren, Meng Wang, Xuedong Yuan, Xiaojing Shen, Huazhu Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis. However, prevailing visual grounding approaches necessitate the manual extraction of key phrases from medical reports, imposing substantial burdens on both system efficiency and physicians. In this paper, we introduce a novel framework, Medical Report Grounding (MedRG), an end-to-end solution for utilizing a multi-modal Large Language Model to predict key phrase by incorporating a unique token, BOX, into the vocabulary to serve as an embedding for unlocking detection capabilities. Subsequently, the vision encoder-decoder jointly decodes the hidden embedding and the input medical image, generating the corresponding grounding box. The experimental results validate the effectiveness of MedRG, surpassing the performance of the existing state-of-the-art medical phrase grounding methods. This study represents a pioneering exploration of the medical report grounding task, marking the first-ever endeavor in this domain.
Sound Matters: Auditory Detectability of Mobile Robots
Authors: Authors: Subham Agrawal, Marlene Wessels, Jorge de Heuvel, Johannes Kraus, Maren Bennewitz
Abstract
Mobile robots are increasingly being used in noisy environments for social purposes, e.g. to provide support in healthcare or public spaces. Since these robots also operate beyond human sight, the question arises as to how different robot types, ambient noise or cognitive engagement impacts the detection of the robots by their sound. To address this research gap, we conducted a user study measuring auditory detection distances for a wheeled (Turtlebot 2i) and quadruped robot (Unitree Go 1), which emit different consequential sounds when moving. Additionally, we also manipulated background noise levels and participants' engagement in a secondary task during the study. Our results showed that the quadruped robot sound was detected significantly better (i.e., at a larger distance) than the wheeled one, which demonstrates that the movement mechanism has a meaningful impact on the auditory detectability. The detectability for both robots diminished significantly as background noise increased. But even in high background noise, participants detected the quadruped robot at a significantly larger distance. The engagement in a secondary task had hardly any impact. In essence, these findings highlight the critical role of distinguishing auditory characteristics of different robots to improve the smooth human-centered navigation of mobile robots in noisy environments.
SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection
Authors: Authors: Mathis Kruse, Marco Rudolph, Dominik Woiwode, Bodo Rosenhahn
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
Detecting anomalies in images has become a well-explored problem in both academia and industry. State-of-the-art algorithms are able to detect defects in increasingly difficult settings and data modalities. However, most current methods are not suited to address 3D objects captured from differing poses. While solutions using Neural Radiance Fields (NeRFs) have been proposed, they suffer from excessive computation requirements, which hinder real-world usability. For this reason, we propose the novel 3D Gaussian splatting-based framework SplatPose which, given multi-view images of a 3D object, accurately estimates the pose of unseen views in a differentiable manner, and detects anomalies in them. We achieve state-of-the-art results in both training and inference speed, and detection performance, even when using less training data than competing methods. We thoroughly evaluate our framework using the recently proposed Pose-agnostic Anomaly Detection benchmark and its multi-pose anomaly detection (MAD) data set.
Beyond Random Inputs: A Novel ML-Based Hardware Fuzzing
Abstract
Modern computing systems heavily rely on hardware as the root of trust. However, their increasing complexity has given rise to security-critical vulnerabilities that cross-layer at-tacks can exploit. Traditional hardware vulnerability detection methods, such as random regression and formal verification, have limitations. Random regression, while scalable, is slow in exploring hardware, and formal verification techniques are often concerned with manual effort and state explosions. Hardware fuzzing has emerged as an effective approach to exploring and detecting security vulnerabilities in large-scale designs like modern processors. They outperform traditional methods regarding coverage, scalability, and efficiency. However, state-of-the-art fuzzers struggle to achieve comprehensive coverage of intricate hardware designs within a practical timeframe, often falling short of a 70% coverage threshold. We propose a novel ML-based hardware fuzzer, ChatFuzz, to address this challenge. Ourapproach leverages LLMs like ChatGPT to understand processor language, focusing on machine codes and generating assembly code sequences. RL is integrated to guide the input generation process by rewarding the inputs using code coverage metrics. We use the open-source RISCV-based RocketCore processor as our testbed. ChatFuzz achieves condition coverage rate of 75% in just 52 minutes compared to a state-of-the-art fuzzer, which requires a lengthy 30-hour window to reach a similar condition coverage. Furthermore, our fuzzer can attain 80% coverage when provided with a limited pool of 10 simulation instances/licenses within a 130-hour window. During this time, it conducted a total of 199K test cases, of which 6K produced discrepancies with the processor's golden model. Our analysis identified more than 10 unique mismatches, including two new bugs in the RocketCore and discrepancies from the RISC-V ISA Simulator.
Monocular 3D lane detection for Autonomous Driving: Recent Achievements, Challenges, and Outlooks
Authors: Authors: Fulong Ma, Weiqing Qi, Guoyang Zhao, Linwei Zheng, Sheng Wang, Ming Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
3D lane detection plays a crucial role in autonomous driving by extracting structural and traffic information from the road in 3D space to assist the self-driving car in rational, safe, and comfortable path planning and motion control. Due to the consideration of sensor costs and the advantages of visual data in color information, in practical applications, 3D lane detection based on monocular vision is one of the important research directions in the field of autonomous driving, which has attracted more and more attention in both industry and academia. Unfortunately, recent progress in visual perception seems insufficient to develop completely reliable 3D lane detection algorithms, which also hinders the development of vision-based fully autonomous self-driving cars, i.e., achieving level 5 autonomous driving, driving like human-controlled cars. This is one of the conclusions drawn from this review paper: there is still a lot of room for improvement and significant improvements are still needed in the 3D lane detection algorithm for autonomous driving cars using visual sensors. Motivated by this, this review defines, analyzes, and reviews the current achievements in the field of 3D lane detection research, and the vast majority of the current progress relies heavily on computationally complex deep learning models. In addition, this review covers the 3D lane detection pipeline, investigates the performance of state-of-the-art algorithms, analyzes the time complexity of cutting-edge modeling choices, and highlights the main achievements and limitations of current research efforts. The survey also includes a comprehensive discussion of available 3D lane detection datasets and the challenges that researchers have faced but have not yet resolved. Finally, our work outlines future research directions and welcomes researchers and practitioners to enter this exciting field.
Research on Detection of Floating Objects in River and Lake Based on AI Intelligent Image Recognition
Authors: Authors: Jingyu Zhang, Ao Xiang, Yu Cheng, Qin Yang, Liyang Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
With the rapid advancement of artificial intelligence technology, AI-enabled image recognition has emerged as a potent tool for addressing challenges in traditional environmental monitoring. This study focuses on the detection of floating objects in river and lake environments, exploring an innovative approach based on deep learning. By intricately analyzing the technical pathways for detecting static and dynamic features and considering the characteristics of river and lake debris, a comprehensive image acquisition and processing workflow has been developed. The study highlights the application and performance comparison of three mainstream deep learning models -SSD, Faster-RCNN, and YOLOv5- in debris identification. Additionally, a detection system for floating objects has been designed and implemented, encompassing both hardware platform construction and software framework development. Through rigorous experimental validation, the proposed system has demonstrated its ability to significantly enhance the accuracy and efficiency of debris detection, thus offering a new technological avenue for water quality monitoring in rivers and lakes
SparseAD: Sparse Query-Centric Paradigm for Efficient End-to-End Autonomous Driving
Abstract
End-to-End paradigms use a unified framework to implement multi-tasks in an autonomous driving system. Despite simplicity and clarity, the performance of end-to-end autonomous driving methods on sub-tasks is still far behind the single-task methods. Meanwhile, the widely used dense BEV features in previous end-to-end methods make it costly to extend to more modalities or tasks. In this paper, we propose a Sparse query-centric paradigm for end-to-end Autonomous Driving (SparseAD), where the sparse queries completely represent the whole driving scenario across space, time and tasks without any dense BEV representation. Concretely, we design a unified sparse architecture for perception tasks including detection, tracking, and online mapping. Moreover, we revisit motion prediction and planning, and devise a more justifiable motion planner framework. On the challenging nuScenes dataset, SparseAD achieves SOTA full-task performance among end-to-end methods and significantly narrows the performance gap between end-to-end paradigms and single-task methods. Codes will be released soon.
SARA: Smart AI Reading Assistant for Reading Comprehension
Abstract
SARA integrates Eye Tracking and state-of-the-art large language models in a mixed reality framework to enhance the reading experience by providing personalized assistance in real-time. By tracking eye movements, SARA identifies the text segments that attract the user's attention the most and potentially indicate uncertain areas and comprehension issues. The process involves these key steps: text detection and extraction, gaze tracking and alignment, and assessment of detected reading difficulty. The results are customized solutions presented directly within the user's field of view as virtual overlays on identified difficult text areas. This support enables users to overcome challenges like unfamiliar vocabulary and complex sentences by offering additional context, rephrased solutions, and multilingual help. SARA's innovative approach demonstrates it has the potential to transform the reading experience and improve reading proficiency.
ChildCIdbLong: Longitudinal Child-Computer Interaction Database and Quantitative Analysis for Child Development
Authors: Authors: Juan Carlos Ruiz-Garcia, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Jaime Herreros-Rodriguez
Abstract
This article provides a comprehensive overview of recent research in the area of Child-Computer Interaction (CCI). The main contributions of the present article are two-fold. First, we present a novel longitudinal CCI database named ChildCIdbLong, which comprises over 600 children aged 18 months to 8 years old, acquired continuously over 4 academic years (2019-2023). As a result, ChildCIdbLong comprises over 12K test acquisitions over a tablet device. Different tests are considered in ChildCIdbLong, requiring different touch and stylus gestures, enabling evaluation of skills like hand-eye coordination, fine motor skills, planning, and visual tracking, among others. In addition to the ChildCIdbLong database, we propose a novel quantitative metric called Test Quality (Q), designed to measure the motor and cognitive development of children through their interaction with a tablet device. In order to provide a better comprehension of the proposed Q metric, popular percentile-based growth representations are introduced for each test, providing a two-dimensional space to compare children's development with respect to the typical age skills of the population. The results achieved in the present article highlight the potential of the novel ChildCIdbLong database in conjunction with the proposed Q metric to measure the motor and cognitive development of children as they grow up. The proposed framework could be very useful as an automatic tool to support child experts (e.g., paediatricians, educators, or neurologists) for early detection of potential physical/cognitive impairments during children's development.
V-MAD: Video-based Morphing Attack Detection in Operational Scenarios
Abstract
In response to the rising threat of the face morphing attack, this paper introduces and explores the potential of Video-based Morphing Attack Detection (V-MAD) systems in real-world operational scenarios. While current morphing attack detection methods primarily focus on a single or a pair of images, V-MAD is based on video sequences, exploiting the video streams often acquired by face verification tools available, for instance, at airport gates. Through this study, we show for the first time the advantages that the availability of multiple probe frames can bring to the morphing attack detection task, especially in scenarios where the quality of probe images is varied and might be affected, for instance, by pose or illumination variations. Experimental results on a real operational database demonstrate that video sequences represent valuable information for increasing the robustness and performance of morphing attack detection systems.
Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning
Abstract
Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune the both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.
Accurate Tennis Court Line Detection on Amateur Recorded Matches
Abstract
Typically, tennis court line detection is done by running Hough-Line-Detection to find straight lines in the image, and then computing a transformation matrix from the detected lines to create the final court structure. We propose numerous improvements and enhancements to this algorithm, including using pretrained State-of-the-Art shadow-removal and object-detection ML models to make our line-detection more robust. Compared to the original algorithm, our method can accurately detect lines on amateur, dirty courts. When combined with a robust ball-tracking system, our method will enable accurate, automatic refereeing for amateur and professional tennis matches alike.
A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media
Abstract
Dehumanisation involves the perception and or treatment of a social group's members as less than human. This phenomenon is rarely addressed with computational linguistic techniques. We adapt a recently proposed approach for English, making it easier to transfer to other languages and to evaluate, introducing a new sentiment resource, the use of zero-shot cross-lingual valence and arousal detection, and a new method for statistical significance testing. We then apply it to study attitudes to migration expressed in Slovene newspapers, to examine changes in the Slovene discourse on migration between the 2015-16 migration crisis following the war in Syria and the 2022-23 period following the war in Ukraine. We find that while this discourse became more negative and more intense over time, it is less dehumanising when specifically addressing Ukrainian migrants compared to others.
On the Performance of IRS-Assisted SSK and RPM over Rician Fading Channels
Authors: Authors: Harsh Raj, Ugrasen Singh, B. R. Manoj
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Abstract
This paper presents the index modulation, that is, the space-shift keying (SSK) and reflection phase modulation (RPM) schemes for intelligent reflecting surface (IRS)-assisted wireless network. IRS simultaneously reflects the incoming information signal from the base station and explicitly encodes the local information bits in the reflection phase shift of IRS elements. The phase shift of the IRS elements is employed according to local data from the RPM constellation. A joint detection using a maximum-likelihood (ML) decoder is performed for the SSK and RPM symbols over a realistic fading scenario modeled as the Rician fading channel. The pairwise error probability over Rician fading channels is derived and utilized to determine the average bit error rate. In addition, the ergodic capacity of the presented system is derived. The derived analytical results are verified and are in exact agreement with Monte-Carlo simulations.
Identification of Fine-grained Systematic Errors via Controlled Scene Generation
Authors: Authors: Valentyn Boreiko, Matthias Hein, Jan Hendrik Metzen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Many safety-critical applications, especially in autonomous driving, require reliable object detectors. They can be very effectively assisted by a method to search for and identify potential failures and systematic errors before these detectors are deployed. Systematic errors are characterized by combinations of attributes such as object location, scale, orientation, and color, as well as the composition of their respective backgrounds. To identify them, one must rely on something other than real images from a test set because they do not account for very rare but possible combinations of attributes. To overcome this limitation, we propose a pipeline for generating realistic synthetic scenes with fine-grained control, allowing the creation of complex scenes with multiple objects. Our approach, BEV2EGO, allows for a realistic generation of the complete scene with road-contingent control that maps 2D bird's-eye view (BEV) scene configurations to a first-person view (EGO). In addition, we propose a benchmark for controlled scene generation to select the most appropriate generative outpainting model for BEV2EGO. We further use it to perform a systematic analysis of multiple state-of-the-art object detection models and discover differences between them.
Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation
Abstract
Metaphors, although occasionally unperceived, are ubiquitous in our everyday language. Thus, it is crucial for Language Models to be able to grasp the underlying meaning of this kind of figurative language. In this work, we present Meta4XNLI, a novel parallel dataset for the tasks of metaphor detection and interpretation that contains metaphor annotations in both Spanish and English. We investigate language models' metaphor identification and understanding abilities through a series of monolingual and cross-lingual experiments by leveraging our proposed corpus. In order to comprehend how these non-literal expressions affect models' performance, we look over the results and perform an error analysis. Additionally, parallel data offers many potential opportunities to investigate metaphor transferability between these languages and the impact of translation on the development of multilingual annotated resources.
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?
Authors: Authors: Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du, Yongfeng Zhang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
This paper studies the phenomenon that different concepts are learned in different layers of large language models, i.e. more difficult concepts are fully acquired with deeper layers. We define the difficulty of concepts by the level of abstraction, and here it is crudely categorized by factual, emotional, and inferential. Each category contains a spectrum of tasks, arranged from simple to complex. For example, within the factual dimension, tasks range from lie detection to categorizing mathematical problems. We employ a probing technique to extract representations from different layers of the model and apply these to classification tasks. Our findings reveal that models tend to efficiently classify simpler tasks, indicating that these concepts are learned in shallower layers. Conversely, more complex tasks may only be discernible at deeper layers, if at all. This paper explores the implications of these findings for our understanding of model learning processes and internal representations. Our implementation is available at \url{https://github.com/Luckfort/CD}.
Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and Detection
Authors: Authors: Linas Nasvytis, Kai Sandbrink, Jakob Foerster, Tim Franzmeyer, Christian Schroeder de Witt
Abstract
While reinforcement learning (RL) algorithms have been successfully applied across numerous sequential decision-making problems, their generalization to unforeseen testing environments remains a significant concern. In this paper, we study the problem of out-of-distribution (OOD) detection in RL, which focuses on identifying situations at test time that RL agents have not encountered in their training environments. We first propose a clarification of terminology for OOD detection in RL, which aligns it with the literature from other machine learning domains. We then present new benchmark scenarios for OOD detection, which introduce anomalies with temporal autocorrelation into different components of the agent-environment loop. We argue that such scenarios have been understudied in the current literature, despite their relevance to real-world situations. Confirming our theoretical predictions, our experimental results suggest that state-of-the-art OOD detectors are not able to identify such anomalies. To address this problem, we propose a novel method for OOD detection, which we call DEXTER (Detection via Extraction of Time Series Representations). By treating environment observations as time series data, DEXTER extracts salient time series features, and then leverages an ensemble of isolation forest algorithms to detect anomalies. We find that DEXTER can reliably identify anomalies across benchmark scenarios, exhibiting superior performance compared to both state-of-the-art OOD detectors and high-dimensional changepoint detectors adopted from statistics.
Measuring proximity to standard planes during fetal brain ultrasound scanning
Authors: Authors: Chiara Di Vece, Antonio Cirigliano, Meala Le Lous, Raffaele Napolitano, Anna L. David, Donald Peebles, Pierre Jannin, Francisco Vasconcelos, Danail Stoyanov
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Abstract
This paper introduces a novel pipeline designed to bring ultrasound (US) plane pose estimation closer to clinical use for more effective navigation to the standard planes (SPs) in the fetal brain. We propose a semi-supervised segmentation model utilizing both labeled SPs and unlabeled 3D US volume slices. Our model enables reliable segmentation across a diverse set of fetal brain images. Furthermore, the model incorporates a classification mechanism to identify the fetal brain precisely. Our model not only filters out frames lacking the brain but also generates masks for those containing it, enhancing the relevance of plane pose regression in clinical settings. We focus on fetal brain navigation from 2D ultrasound (US) video analysis and combine this model with a US plane pose regression network to provide sensorless proximity detection to SPs and non-SPs planes; we emphasize the importance of proximity detection to SPs for guiding sonographers, offering a substantial advantage over traditional methods by allowing earlier and more precise adjustments during scanning. We demonstrate the practical applicability of our approach through validation on real fetal scan videos obtained from sonographers of varying expertise levels. Our findings demonstrate the potential of our approach to complement existing fetal US technologies and advance prenatal diagnostic practices.
Keyword: face recognition
The Impact of Print-and-Scan in Heterogeneous Morph Evaluation Scenarios
Authors: Authors: Richard E. Neddo, Zander W. Blasingame, Chen Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Face morphing attacks present an emerging threat to the face recognition system. On top of that, printing and scanning the morphed images could obscure the artifacts generated during the morphing process, which makes morphed image detection even harder. In this work, we investigate the impact that printing and scanning has on morphing attacks through a series of heterogeneous tests. Our experiments show that we can increase the possibility of a false match by up to 5.64% for DiM and 16.00% for StyleGAN2 when providing an image that has been printed and scanned, regardless it is morphed or bona fide, to a Face Recognition (FR) system. Likewise, using Frechet Inception Distance (FID) metric, strictly print-scanned morph attacks performed on average 9.185% stronger than non-print-scanned digital morphs.
Keyword: augmentation
Evolving Loss Functions for Specific Image Augmentation Techniques
Authors: Authors: Brandon Morgan, Dean Hougen
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Abstract
Previous work in Neural Loss Function Search (NLFS) has shown a lack of correlation between smaller surrogate functions and large convolutional neural networks with massive regularization. We expand upon this research by revealing another disparity that exists, correlation between different types of image augmentation techniques. We show that different loss functions can perform well on certain image augmentation techniques, while performing poorly on others. We exploit this disparity by performing an evolutionary search on five types of image augmentation techniques in the hopes of finding image augmentation specific loss functions. The best loss functions from each evolution were then taken and transferred to WideResNet-28-10 on CIFAR-10 and CIFAR-100 across each of the five image augmentation techniques. The best from that were then taken and evaluated by fine-tuning EfficientNetV2Small on the CARS, Oxford-Flowers, and Caltech datasets across each of the five image augmentation techniques. Multiple loss functions were found that outperformed cross-entropy across multiple experiments. In the end, we found a single loss function, which we called the inverse bessel logarithm loss, that was able to outperform cross-entropy across the majority of experiments.
An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video
Abstract
The study of action recognition has attracted considerable attention recently due to its broad applications in multiple areas. However, with the issue of discontinuous training video, which not only decreases the performance of action recognition model, but complicates the data augmentation process as well, still remains under-exploration. In this study, we introduce the 4A (Action Animation-based Augmentation Approach), an innovative pipeline for data augmentation to address the problem. The main contributions remain in our work includes: (1) we investigate the problem of severe decrease on performance of action recognition task training by discontinuous video, and the limitation of existing augmentation methods on solving this problem. (2) we propose a novel augmentation pipeline, 4A, to address the problem of discontinuous video for training, while achieving a smoother and natural-looking action representation than the latest data augmentation methodology. (3) We achieve the same performance with only 10% of the original data for training as with all of the original data from the real-world dataset, and a better performance on In-the-wild videos, by employing our data augmentation techniques.
Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation
Authors: Authors: Ruotong Pan, Boxi Cao, Hongyu Lin, Xianpei Han, Jia Zheng, Sirui Wang, Xunliang Cai, Le Sun
Abstract
The rapid development of large language models has led to the widespread adoption of Retrieval-Augmented Generation (RAG), which integrates external knowledge to alleviate knowledge bottlenecks and mitigate hallucinations. However, the existing RAG paradigm inevitably suffers from the impact of flawed information introduced during the retrieval phrase, thereby diminishing the reliability and correctness of the generated outcomes. In this paper, we propose Credibility-aware Generation (CAG), a universally applicable framework designed to mitigate the impact of flawed information in RAG. At its core, CAG aims to equip models with the ability to discern and process information based on its credibility. To this end, we propose an innovative data transformation framework that generates data based on credibility, thereby effectively endowing models with the capability of CAG. Furthermore, to accurately evaluate the models' capabilities of CAG, we construct a comprehensive benchmark covering three critical real-world scenarios. Experimental results demonstrate that our model can effectively understand and utilize credibility for generation, significantly outperform other models with retrieval augmentation, and exhibit resilience against the disruption caused by noisy documents, thereby maintaining robust performance. Moreover, our model supports customized credibility, offering a wide range of potential applications.
ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling
Authors: Authors: Ege Özsoy, Chantal Pellegrini, Matthias Keicher, Nassir Navab
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Every day, countless surgeries are performed worldwide, each within the distinct settings of operating rooms (ORs) that vary not only in their setups but also in the personnel, tools, and equipment used. This inherent diversity poses a substantial challenge for achieving a holistic understanding of the OR, as it requires models to generalize beyond their initial training datasets. To reduce this gap, we introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling, which incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios. This capability is further enhanced by our novel data augmentation framework, which significantly diversifies the training dataset, ensuring ORacle's proficiency in applying the provided knowledge effectively. In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models. Furthermore, its adaptability is displayed through its ability to interpret unseen views, actions, and appearances of tools and equipment. This demonstrates ORacle's potential to significantly enhance the scalability and affordability of OR domain modeling and opens a pathway for future advancements in surgical data science. We will release our code and data upon acceptance.
LaTiM: Longitudinal representation learning in continuous-time models to predict disease progression
Authors: Authors: Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Hugo Le Boité, Ramin Tadayoni, Pascal Massin, Béatrice Cochener, Alireza Rezaei, Ikram Brahim, Gwenolé Quellec, Mathieu Lamard
Abstract
This work proposes a novel framework for analyzing disease progression using time-aware neural ordinary differential equations (NODE). We introduce a "time-aware head" in a framework trained through self-supervised learning (SSL) to leverage temporal information in latent space for data augmentation. This approach effectively integrates NODEs with SSL, offering significant performance improvements compared to traditional methods that lack explicit temporal integration. We demonstrate the effectiveness of our strategy for diabetic retinopathy progression prediction using the OPHDIAT database. Compared to the baseline, all NODE architectures achieve statistically significant improvements in area under the ROC curve (AUC) and Kappa metrics, highlighting the efficacy of pre-training with SSL-inspired approaches. Additionally, our framework promotes stable training for NODEs, a commonly encountered challenge in time-aware modeling.
Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations
Authors: Authors: Ofir Shifman, Yair Weiss
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Deep neural networks that achieve remarkable performance in image classification have previously been shown to be easily fooled by tiny transformations such as a one pixel translation of the input image. In order to address this problem, two approaches have been proposed in recent years. The first approach suggests using huge datasets together with data augmentation in the hope that a highly varied training set will teach the network to learn to be invariant. The second approach suggests using architectural modifications based on sampling theory to deal explicitly with image translations. In this paper, we show that these approaches still fall short in robustly handling 'natural' image translations that simulate a subtle change in camera orientation. Our findings reveal that a mere one-pixel translation can result in a significant change in the predicted image representation for approximately 40% of the test images in state-of-the-art models (e.g. open-CLIP trained on LAION-2B or DINO-v2) , while models that are explicitly constructed to be robust to cyclic translations can still be fooled with 1 pixel realistic (non-cyclic) translations 11% of the time. We present Robust Inference by Crop Selection: a simple method that can be proven to achieve any desired level of consistency, although with a modest tradeoff with the model's accuracy. Importantly, we demonstrate how employing this method reduces the ability to fool state-of-the-art models with a 1 pixel translation to less than 5% while suffering from only a 1% drop in classification accuracy. Additionally, we show that our method can be easy adjusted to deal with circular shifts as well. In such case we achieve 100% robustness to integer shifts with state-of-the-art accuracy, and with no need for any further training.
Keyword: detection
The Impact of Print-and-Scan in Heterogeneous Morph Evaluation Scenarios
MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection
Detecting Refactoring Commits in Machine Learning Python Projects: A Machine Learning-Based Approach
FlameFinder: Illuminating Obscured Fire through Smoke with Attentive Deep Metric Learning
Multi-modal Document Presentation Attack Detection With Forensics Trace Disentanglement
What's Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs
Unsupervised Visible-Infrared ReID via Pseudo-label Correction and Modality-level Alignment
Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting
Sparse Points to Dense Clouds: Enhancing 3D Detection with Limited LiDAR Data
Transferable and Efficient Non-Factual Content Detection via Probe Training with Offline Consistency Checking
MedRG: Medical Report Grounding with Multi-modal Large Language Model
Sound Matters: Auditory Detectability of Mobile Robots
SplatPose & Detect: Pose-Agnostic 3D Anomaly Detection
Beyond Random Inputs: A Novel ML-Based Hardware Fuzzing
Monocular 3D lane detection for Autonomous Driving: Recent Achievements, Challenges, and Outlooks
Research on Detection of Floating Objects in River and Lake Based on AI Intelligent Image Recognition
SparseAD: Sparse Query-Centric Paradigm for Efficient End-to-End Autonomous Driving
SARA: Smart AI Reading Assistant for Reading Comprehension
ChildCIdbLong: Longitudinal Child-Computer Interaction Database and Quantitative Analysis for Child Development
V-MAD: Video-based Morphing Attack Detection in Operational Scenarios
Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning
Accurate Tennis Court Line Detection on Amateur Recorded Matches
A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media
On the Performance of IRS-Assisted SSK and RPM over Rician Fading Channels
Identification of Fine-grained Systematic Errors via Controlled Scene Generation
Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?
Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and Detection
Measuring proximity to standard planes during fetal brain ultrasound scanning
Keyword: face recognition
The Impact of Print-and-Scan in Heterogeneous Morph Evaluation Scenarios
Keyword: augmentation
Evolving Loss Functions for Specific Image Augmentation Techniques
An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video
Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation
ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling
LaTiM: Longitudinal representation learning in continuous-time models to predict disease progression
Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations