New submissions for Fri, 8 Mar 24

Keyword: detection

Video Relationship Detection Using Mixture of Experts

Authors: Authors: Ala Shaabana, Zahra Gharaee, Paul Fieguth
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.03994
Pdf link: https://arxiv.org/pdf/2403.03994
Abstract Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
OpenVPN is Open to VPN Fingerprinting
Authors: Authors: Diwen Xue, Reethika Ramesh, Arham Jain, Michalis Kallitsis, J. Alex Halderman, Jedidiah R. Crandall, Roya Ensafi
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2403.03998
Pdf link: https://arxiv.org/pdf/2403.03998
Abstract VPN adoption has seen steady growth over the past decade due to increased public awareness of privacy and surveillance threats. In response, certain governments are attempting to restrict VPN access by identifying connections using "dual use" DPI technology. To investigate the potential for VPN blocking, we develop mechanisms for accurately fingerprinting connections using OpenVPN, the most popular protocol for commercial VPN services. We identify three fingerprints based on protocol features such as byte pattern, packet size, and server response. Playing the role of an attacker who controls the network, we design a two-phase framework that performs passive fingerprinting and active probing in sequence. We evaluate our framework in partnership with a million-user ISP and find that we identify over 85% of OpenVPN flows with only negligible false positives, suggesting that OpenVPN-based services can be effectively blocked with little collateral damage. Although some commercial VPNs implement countermeasures to avoid detection, our framework successfully identified connections to 34 out of 41 "obfuscated" VPN configurations. We discuss the implications of the VPN fingerprintability for different threat models and propose short-term defenses. In the longer term, we urge commercial VPN providers to be more transparent about their obfuscation approaches and to adopt more principled detection countermeasures, such as those developed in censorship circumvention research.
Human I/O: Towards a Unified Approach to Detecting Situational Impairments
Authors: Authors: Xingyu Bruce Liu, Jiahao Nick Li, David Kim, Xiang 'Anthony' Chen, Ruofei Du
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2403.04008
Pdf link: https://arxiv.org/pdf/2403.04008
Abstract Situationally Induced Impairments and Disabilities (SIIDs) can significantly hinder user experience in contexts such as poor lighting, noise, and multi-tasking. While prior research has introduced algorithms and systems to address these impairments, they predominantly cater to specific tasks or environments and fail to accommodate the diverse and dynamic nature of SIIDs. We introduce Human I/O, a unified approach to detecting a wide range of SIIDs by gauging the availability of human input/output channels. Leveraging egocentric vision, multimodal sensing and reasoning with large language models, Human I/O achieves a 0.22 mean absolute error and a 82% accuracy in availability prediction across 60 in-the-wild egocentric video recordings in 32 different scenarios. Furthermore, while the core focus of our work is on the detection of SIIDs rather than the creation of adaptive user interfaces, we showcase the efficacy of our prototype via a user study with 10 participants. Findings suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs, paving the way for more adaptive and accessible interactive systems in the future.
Three Revisits to Node-Level Graph Anomaly Detection: Outliers, Message Passing and Hyperbolic Neural Networks
Authors: Authors: Jing Gu, Dongmian Zou
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.04010
Pdf link: https://arxiv.org/pdf/2403.04010
Abstract Graph anomaly detection plays a vital role for identifying abnormal instances in complex networks. Despite advancements of methodology based on deep learning in recent years, existing benchmarking approaches exhibit limitations that hinder a comprehensive comparison. In this paper, we revisit datasets and approaches for unsupervised node-level graph anomaly detection tasks from three aspects. Firstly, we introduce outlier injection methods that create more diverse and graph-based anomalies in graph datasets. Secondly, we compare methods employing message passing against those without, uncovering the unexpected decline in performance associated with message passing. Thirdly, we explore the use of hyperbolic neural networks, specifying crucial architecture and loss design that contribute to enhanced performance. Through rigorous experiments and evaluations, our study sheds light on general strategies for improving node-level graph anomaly detection methods.
ZTRAN: Prototyping Zero Trust Security xApps for Open Radio Access Network Deployments
Authors: Authors: Aly S. Abdalla, Joshua Moore, Nisha Adhikari, Vuk Marojevic
Subjects: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2403.04113
Pdf link: https://arxiv.org/pdf/2403.04113
Abstract The open radio access network (O-RAN) offers new degrees of freedom for building and operating advanced cellular networks. Emphasizing on RAN disaggregation, open interfaces, multi-vendor support, and RAN intelligent controllers (RICs), O-RAN facilitates adaptation to new applications and technology trends. Yet, this architecture introduces new security challenges. This paper proposes leveraging zero trust principles for O-RAN security. We introduce zero trust RAN (ZTRAN), which embeds service authentication, intrusion detection, and secure slicing subsystems that are encapsulated as xApps. We implement ZTRAN on the open artificial intelligence cellular (OAIC) research platform and demonstrate its feasibility and effectiveness in terms of legitimate user throughput and latency figures. Our experimental analysis illustrates how ZTRAN's intrusion detection and secure slicing microservices operate effectively and in concert as part of O-RAN Alliance's containerized near-real time RIC. Research directions include exploring machine learning and additional threat intelligence feeds for improving the performance and extending the scope of ZTRAN.
Scalable and Robust Transformer Decoders for Interpretable Image Classification with Foundation Models
Authors: Authors: Evelyn Mannix, Howard Bondell
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04125
Pdf link: https://arxiv.org/pdf/2403.04125
Abstract Interpretable computer vision models can produce transparent predictions, where the features of an image are compared with prototypes from a training dataset and the similarity between them forms a basis for classification. Nevertheless these methods are computationally expensive to train, introduce additional complexity and may require domain knowledge to adapt hyper-parameters to a new dataset. Inspired by developments in object detection, segmentation and large-scale self-supervised foundation vision models, we introduce Component Features (ComFe), a novel explainable-by-design image classification approach using a transformer-decoder head and hierarchical mixture-modelling. With only global image labels and no segmentation or part annotations, ComFe can identify consistent image components, such as the head, body, wings and tail of a bird, and the image background, and determine which of these features are informative in making a prediction. We demonstrate that ComFe obtains higher accuracy compared to previous interpretable models across a range of fine-grained vision benchmarks, without the need to individually tune hyper-parameters for each dataset. We also show that ComFe outperforms a non-interpretable linear head across a range of datasets, including ImageNet, and improves performance on generalisation and robustness benchmarks.
An Explainable AI Framework for Artificial Intelligence of Medical Things
Authors: Authors: Al Amin, Kamrul Hasan, Saleh Zein-Sabatto, Deo Chimba, Imtiaz Ahmed, Tariqul Islam
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04130
Pdf link: https://arxiv.org/pdf/2403.04130
Abstract The healthcare industry has been revolutionized by the convergence of Artificial Intelligence of Medical Things (AIoMT), allowing advanced data-driven solutions to improve healthcare systems. With the increasing complexity of Artificial Intelligence (AI) models, the need for Explainable Artificial Intelligence (XAI) techniques become paramount, particularly in the medical domain, where transparent and interpretable decision-making becomes crucial. Therefore, in this work, we leverage a custom XAI framework, incorporating techniques such as Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and Gradient-weighted Class Activation Mapping (Grad-Cam), explicitly designed for the domain of AIoMT. The proposed framework enhances the effectiveness of strategic healthcare methods and aims to instill trust and promote understanding in AI-driven medical applications. Moreover, we utilize a majority voting technique that aggregates predictions from multiple convolutional neural networks (CNNs) and leverages their collective intelligence to make robust and accurate decisions in the healthcare system. Building upon this decision-making process, we apply the XAI framework to brain tumor detection as a use case demonstrating accurate and transparent diagnosis. Evaluation results underscore the exceptional performance of the XAI framework, achieving high precision, recall, and F1 scores with a training accuracy of 99% and a validation accuracy of 98%. Combining advanced XAI techniques with ensemble-based deep-learning (DL) methodologies allows for precise and reliable brain tumor diagnoses as an application of AIoMT.
FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning
Authors: Authors: Hong Lin, Lidan Shou, Ke Chen, Gang Chen, Sai Wu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2403.04146
Pdf link: https://arxiv.org/pdf/2403.04146
Abstract Federated learning (FL) is a promising approach for learning a model from data distributed on massive clients without exposing data privacy. It works effectively in the ideal federation where clients share homogeneous data distribution and learning behavior. However, FL may fail to function appropriately when the federation is not ideal, amid an unhealthy state called Negative Federated Learning (NFL), in which most clients gain no benefit from participating in FL. Many studies have tried to address NFL. However, their solutions either (1) predetermine to prevent NFL in the entire learning life-cycle or (2) tackle NFL in the aftermath of numerous learning rounds. Thus, they either (1) indiscriminately incur extra costs even if FL can perform well without such costs or (2) waste numerous learning rounds. Additionally, none of the previous work takes into account the clients who may be unwilling/unable to follow the proposed NFL solutions when using those solutions to upgrade an FL system in use. This paper introduces FL-GUARD, a holistic framework that can be employed on any FL system for tackling NFL in a run-time paradigm. That is, to dynamically detect NFL at the early stage (tens of rounds) of learning and then to activate recovery measures when necessary. Specifically, we devise a cost-effective NFL detection mechanism, which relies on an estimation of performance gain on clients. Only when NFL is detected, we activate the NFL recovery process, in which each client learns in parallel an adapted model when training the global model. Extensive experiment results confirm the effectiveness of FL-GUARD in detecting NFL and recovering from NFL to a healthy learning state. We also show that FL-GUARD is compatible with previous NFL solutions and robust against clients unwilling/unable to take any recovery measures.
Dual-path Frequency Discriminators for Few-shot Anomaly Detection
Authors: Authors: Yuhu Bai, Jiangning Zhang, Yuhang Dong, Guanzhong Tian, Yunkang Cao, Yabiao Wang, Chengjie Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04151
Pdf link: https://arxiv.org/pdf/2403.04151
Abstract Few-shot anomaly detection (FSAD) is essential in industrial manufacturing. However, existing FSAD methods struggle to effectively leverage a limited number of normal samples, and they may fail to detect and locate inconspicuous anomalies in the spatial domain. We further discover that these subtle anomalies would be more noticeable in the frequency domain. In this paper, we propose a Dual-Path Frequency Discriminators (DFD) network from a frequency perspective to tackle these issues. Specifically, we generate anomalies at both image-level and feature-level. Differential frequency components are extracted by the multi-frequency information construction module and supplied into the fine-grained feature construction module to provide adapted features. We consider anomaly detection as a discriminative classification problem, wherefore the dual-path feature discrimination module is employed to detect and locate the image-level and feature-level anomalies in the feature space. The discriminators aim to learn a joint representation of anomalous features and normal features in the latent space. Extensive experiments conducted on MVTec AD and VisA benchmarks demonstrate that our DFD surpasses current state-of-the-art methods. Source code will be available.
Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation
Authors: Authors: Sai Akarsh, Vamshi Raghusimha, Anindita Mondal, Anil Vuppala
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2403.04178
Pdf link: https://arxiv.org/pdf/2403.04178
Abstract The language diversity in India's education sector poses a significant challenge, hindering inclusivity. Despite the democratization of knowledge through online educational content, the dominance of English, as the internet's lingua franca, limits accessibility, emphasizing the crucial need for translation into Indian languages. Despite existing Speech-to-Speech Machine Translation (SSMT) technologies, the lack of intonation in these systems gives monotonous translations, leading to a loss of audience interest and disengagement from the content. To address this, our paper introduces a dataset with stress annotations in Indian English and also a Text-to-Speech (TTS) architecture capable of incorporating stress into synthesized speech. This dataset is used for training a stress detection model, which is then used in the SSMT system for detecting stress in the source speech and transferring it into the target language speech. The TTS architecture is based on FastPitch and can modify the variances based on stressed words given. We present an Indian English-to-Hindi SSMT system that can transfer stress and aim to enhance the overall quality and engagement of educational content.
VAEMax: Open-Set Intrusion Detection based on OpenMax and Variational Autoencoder
Authors: Authors: Zhiyin Qiu, Ding Zhou, Yahui Zhai, Bo Liu, Lei He, Jiuxin Cao
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2403.04193
Pdf link: https://arxiv.org/pdf/2403.04193
Abstract Promptly discovering unknown network attacks is critical for reducing the risk of major loss imposed on system or equipment. This paper aims to develop an open-set intrusion detection model to classify known attacks as well as inferring unknown ones. To achieve this, we employ OpenMax and variational autoencoder to propose a dual detection model, VAEMax. First, we extract flow payload feature based on one-dimensional convolutional neural network. Then, the OpenMax is used to classify flows, during which some unknown attacks can be detected, while the rest are misclassified into a certain class of known flows. Finally, use VAE to perform secondary detection on each class of flows, and determine whether the flow is an unknown attack based on the reconstruction loss. Experiments performed on dataset CIC-IDS2017 and CSE-CIC-IDS2018 show our approach is better than baseline models and can be effectively applied to realistic network environments.
CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images
Authors: Authors: Guanlin Shen, Jingwei Huang, Zhihua Hu, Bin Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04198
Pdf link: https://arxiv.org/pdf/2403.04198
Abstract This paper introduces CN-RMA, a novel approach for 3D indoor object detection from multi-view images. We observe the key challenge as the ambiguity of image and 3D correspondence without explicit geometry to provide occlusion information. To address this issue, CN-RMA leverages the synergy of 3D reconstruction networks and 3D object detection networks, where the reconstruction network provides a rough Truncated Signed Distance Function (TSDF) and guides image features to vote to 3D space correctly in an end-to-end manner. Specifically, we associate weights to sampled points of each ray through ray marching, representing the contribution of a pixel in an image to corresponding 3D locations. Such weights are determined by the predicted signed distances so that image features vote only to regions near the reconstructed surface. Our method achieves state-of-the-art performance in 3D object detection from multi-view images, as measured by mAP@0.25 and mAP@0.5 on the ScanNet and ARKitScenes datasets. The code and models are released at https://github.com/SerCharles/CN-RMA.
ACC-ViT : Atrous Convolution's Comeback in Vision Transformers
Authors: Authors: Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04200
Pdf link: https://arxiv.org/pdf/2403.04200
Abstract Transformers have elevated to the state-of-the-art vision architectures through innovations in attention mechanism inspired from visual perception. At present two classes of attentions prevail in vision transformers, regional and sparse attention. The former bounds the pixel interactions within a region; the latter spreads them across sparse grids. The opposing natures of them have resulted in a dilemma between either preserving hierarchical relation or attaining a global context. In this work, taking inspiration from atrous convolution, we introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information, while maintaining hierarchical relations. As a further tribute to atrous convolution, we redesign the ubiquitous inverted residual convolution blocks with atrous convolution. Finally, we propose a generalized, hybrid vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks. Our tiny version model achieves $\sim 84 \%$ accuracy on ImageNet-1K, with less than $28.5$ million parameters, which is $0.42\%$ improvement over state-of-the-art MaxViT while having $8.4\%$ less parameters. In addition, we have investigated the efficacy of ACC-ViT backbone under different evaluation settings, such as finetuning, linear probing, and zero-shot learning on tasks involving medical image analysis, object detection, and language-image contrastive learning. ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.
Bi-Static Sensing in OFDM Wireless Systems for Indoor Scenarios
Authors: Authors: Vijaya Yajnanarayana, Philipp Geuer
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2403.04201
Pdf link: https://arxiv.org/pdf/2403.04201
Abstract The sixth generation (6G) systems will likely employ orthogonal frequency division multiplexing (OFDM) waveform for performing the joint task of sensing and communication. In this paper, we design an OFDM system for integrated sensing and communication (ISAC) and propose a novel approach for passive target detection in an indoor deployment using a data driven AI approach. The delay-Doppler profile (DDP) and power delay profile (PDP) is used to train the proposed AI-based detector. We analyze the detection performance of the proposed methods under line of sight (LOS) and non-line of sight (NLOS) conditions for various training strategies. We show that the proposed method provides 10 dB performance improvement over the baseline for 80% target detection under LOS conditions and the performance drops by 10-20 dB for NLOS depending on the usecase scenarios.
MKF-ADS: A Multi-Knowledge Fused Anomaly Detection System for Automotive
Authors: Authors: Pengzhou Cheng, Zongru Wu, Gongshen Liu
Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2403.04293
Pdf link: https://arxiv.org/pdf/2403.04293
Abstract With the requirements of Intelligent Transport Systems (ITSs) for extensive connectivity of Electronic Control Units (ECUs) to the outside world, safety and security have become stringent problems. Intrusion detection systems (IDSs) are a crucial safety component in remediating Controller Area Network (CAN) bus vulnerabilities. However, supervised-based IDSs fail to identify complexity attacks and anomaly-based IDSs have higher false alarms owing to capability bottleneck. In this paper, we propose a novel multi-knowledge fused anomaly detection model, called MKF-IDS. Specifically, the method designs an integration framework, including spatial-temporal correlation with an attention mechanism (STcAM) module and patch sparse-transformer module (PatchST). The STcAM with fine-pruning uses one-dimensional convolution (Conv1D) to extract spatial features and subsequently utilizes the Bidirectional Long Short Term Memory (Bi-LSTM) to extract the temporal features, where the attention mechanism will focus on the important time steps. Meanwhile, the PatchST captures the combined long-time historical features from independent univariate time series. Finally, the proposed method is based on knowledge distillation to STcAM as a student model for learning intrinsic knowledge and cross the ability to mimic PatchST. In the detection phase, the MKF-ADS only deploys STcAM to maintain efficiency in a resource-limited IVN environment. Moreover, the redundant noisy signal is reduced with bit flip rate and boundary decision estimation. We conduct extensive experiments on six simulation attack scenarios across various CAN IDs and time steps, and two real attack scenarios, which present a competitive prediction and detection performance. Compared with the baseline in the same paradigm, the error rate and FAR are 2.62% and 2.41% and achieve a promising F1-score of 97.3%.
Effectiveness Assessment of Recent Large Vision-Language Models
Authors: Authors: Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shahbaz Khan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.04306
Pdf link: https://arxiv.org/pdf/2403.04306
Abstract The advent of large vision-language models (LVLMs) represents a noteworthy advancement towards the pursuit of artificial general intelligence. However, the extent of their efficacy across both specialized and general tasks warrants further investigation. This article endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive comprehension of these innovative methodologies. To gauge their efficacy in specialized tasks, we tailor a comprehensive testbed comprising three distinct scenarios: natural, healthcare, and industrial, encompassing six challenging tasks. These tasks include salient, camouflaged, and transparent object detection, as well as polyp and skin lesion detection, alongside industrial anomaly detection. We examine the performance of three recent open-source LVLMs -- MiniGPT-v2, LLaVA-1.5, and Shikra -- in the realm of visual recognition and localization. Moreover, we conduct empirical investigations utilizing the aforementioned models alongside GPT-4V, assessing their multi-modal understanding capacities in general tasks such as object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these models demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deeper into this inadequacy and suggest several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope this study would provide valuable insights for the future development of LVLMs, augmenting their power in coping with both general and specialized applications.
AO-DETR: Anti-Overlapping DETR for X-Ray Prohibited Items Detection
Authors: Authors: Mingyuan Li, Tong Jia, Hao Wang, Bowen Ma, Shuyang Lin, Da Cai, Dongyue Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.04309
Pdf link: https://arxiv.org/pdf/2403.04309
Abstract Prohibited item detection in X-ray images is one of the most essential and highly effective methods widely employed in various security inspection scenarios. Considering the significant overlapping phenomenon in X-ray prohibited item images, we propose an Anti-Overlapping DETR (AO-DETR) based on one of the state-of-the-art general object detectors, DINO. Specifically, to address the feature coupling issue caused by overlapping phenomena, we introduce the Category-Specific One-to-One Assignment (CSA) strategy to constrain category-specific object queries in predicting prohibited items of fixed categories, which can enhance their ability to extract features specific to prohibited items of a particular category from the overlapping foreground-background features. To address the edge blurring problem caused by overlapping phenomena, we propose the Look Forward Densely (LFD) scheme, which improves the localization accuracy of reference boxes in mid-to-high-level decoder layers and enhances the ability to locate blurry edges of the final layer. Similar to DINO, our AO-DETR provides two different versions with distinct backbones, tailored to meet diverse application requirements. Extensive experiments on the PIXray and OPIXray datasets demonstrate that the proposed method surpasses the state-of-the-art object detectors, indicating its potential applications in the field of prohibited item detection. The source code will be released at https://github.com/Limingyuan001/AO-DETR-test.
Impacts of Color and Texture Distortions on Earth Observation Data in Deep Learning
Authors: Authors: Martin Willbo, Aleksis Pirinen, John Martinsson, Edvin Listo Zec, Olof Mogren, Mikael Nilsson
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.04385
Pdf link: https://arxiv.org/pdf/2403.04385
Abstract Land cover classification and change detection are two important applications of remote sensing and Earth observation (EO) that have benefited greatly from the advances of deep learning. Convolutional and transformer-based U-net models are the state-of-the-art architectures for these tasks, and their performances have been boosted by an increased availability of large-scale annotated EO datasets. However, the influence of different visual characteristics of the input EO data on a model's predictions is not well understood. In this work we systematically examine model sensitivities with respect to several color- and texture-based distortions on the input EO data during inference, given models that have been trained without such distortions. We conduct experiments with multiple state-of-the-art segmentation networks for land cover classification and show that they are in general more sensitive to texture than to color distortions. Beyond revealing intriguing characteristics of widely used land cover classification models, our results can also be used to guide the development of more robust models within the EO domain.
Exploring the Influence of Dimensionality Reduction on Anomaly Detection Performance in Multivariate Time Series
Authors: Authors: Mahsun Altin, Altan Cakir
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.04429
Pdf link: https://arxiv.org/pdf/2403.04429
Abstract This paper presents an extensive empirical study on the integration of dimensionality reduction techniques with advanced unsupervised time series anomaly detection models, focusing on the MUTANT and Anomaly-Transformer models. The study involves a comprehensive evaluation across three different datasets: MSL, SMAP, and SWaT. Each dataset poses unique challenges, allowing for a robust assessment of the models' capabilities in varied contexts. The dimensionality reduction techniques examined include PCA, UMAP, Random Projection, and t-SNE, each offering distinct advantages in simplifying high-dimensional data. Our findings reveal that dimensionality reduction not only aids in reducing computational complexity but also significantly enhances anomaly detection performance in certain scenarios. Moreover, a remarkable reduction in training times was observed, with reductions by approximately 300\% and 650\% when dimensionality was halved and minimized to the lowest dimensions, respectively. This efficiency gain underscores the dual benefit of dimensionality reduction in both performance enhancement and operational efficiency. The MUTANT model exhibits notable adaptability, especially with UMAP reduction, while the Anomaly-Transformer demonstrates versatility across various reduction techniques. These insights provide a deeper understanding of the synergistic effects of dimensionality reduction and anomaly detection, contributing valuable perspectives to the field of time series analysis. The study underscores the importance of selecting appropriate dimensionality reduction strategies based on specific model requirements and dataset characteristics, paving the way for more efficient, accurate, and scalable solutions in anomaly detection.
FriendNet: Detection-Friendly Dehazing Network
Authors: Authors: Yihua Fan, Yongzhen Wang, Mingqiang Wei, Fu Lee Wang, Haoran Xie
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04443
Pdf link: https://arxiv.org/pdf/2403.04443
Abstract Adverse weather conditions often impair the quality of captured images, inevitably inducing cutting-edge object detection models for advanced driver assistance systems (ADAS) and autonomous driving. In this paper, we raise an intriguing question: can the combination of image restoration and object detection enhance detection performance in adverse weather conditions? To answer it, we propose an effective architecture that bridges image dehazing and object detection together via guidance information and task-driven learning to achieve detection-friendly dehazing, termed FriendNet. FriendNet aims to deliver both high-quality perception and high detection capacity. Different from existing efforts that intuitively treat image dehazing as pre-processing, FriendNet establishes a positive correlation between these two tasks. Clean features generated by the dehazing network potentially contribute to improvements in object detection performance. Conversely, object detection crucially guides the learning process of the image dehazing network under the task-driven learning scheme. We shed light on how downstream tasks can guide upstream dehazing processes, considering both network architecture and learning objectives. We design Guidance Fusion Block (GFB) and Guidance Attention Block (GAB) to facilitate the integration of detection information into the network. Furthermore, the incorporation of the detection task loss aids in refining the optimization process. Additionally, we introduce a new Physics-aware Feature Enhancement Block (PFEB), which integrates physics-based priors to enhance the feature extraction and representation capabilities. Extensive experiments on synthetic and real-world datasets demonstrate the superiority of our method over state-of-the-art methods on both image quality and detection precision. Our source code is available at https://github.com/fanyihua0309/FriendNet.
A Survey of Graph Neural Networks in Real world: Imbalance, Noise, Privacy and OOD Challenges
Authors: Authors: Wei Ju, Siyu Yi, Yifan Wang, Zhiping Xiao, Zhengyang Mao, Hourun Li, Yiyang Gu, Yifang Qin, Nan Yin, Senzhang Wang, Xinwang Liu, Xiao Luo, Philip S. Yu, Ming Zhang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2403.04468
Pdf link: https://arxiv.org/pdf/2403.04468
Abstract Graph-structured data exhibits universality and widespread applicability across diverse domains, such as social network analysis, biochemistry, financial fraud detection, and network security. Significant strides have been made in leveraging Graph Neural Networks (GNNs) to achieve remarkable success in these areas. However, in real-world scenarios, the training environment for models is often far from ideal, leading to substantial performance degradation of GNN models due to various unfavorable factors, including imbalance in data distribution, the presence of noise in erroneous data, privacy protection of sensitive information, and generalization capability for out-of-distribution (OOD) scenarios. To tackle these issues, substantial efforts have been devoted to improving the performance of GNN models in practical real-world scenarios, as well as enhancing their reliability and robustness. In this paper, we present a comprehensive survey that systematically reviews existing GNN models, focusing on solutions to the four mentioned real-world challenges including imbalance, noise, privacy, and OOD in practical scenarios that many existing reviews have not considered. Specifically, we first highlight the four key challenges faced by existing GNNs, paving the way for our exploration of real-world GNN models. Subsequently, we provide detailed discussions on these four aspects, dissecting how these solutions contribute to enhancing the reliability and robustness of GNN models. Last but not least, we outline promising directions and offer future perspectives in the field.
Children Age Group Detection based on Human-Computer Interaction and Time Series Analysis
Authors: Authors: Juan Carlos Ruiz-Garcia, Carlos Hojas, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Jaime Herreros-Rodriguez
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2403.04574
Pdf link: https://arxiv.org/pdf/2403.04574
Abstract This article proposes a novel Children-Computer Interaction (CCI) approach for the task of age group detection. This approach focuses on the automatic analysis of the time series generated from the interaction of the children with mobile devices. In particular, we extract a set of 25 time series related to spatial, pressure, and kinematic information of the children interaction while colouring a tree through a pen stylus tablet, a specific test from the large-scale public ChildCIdb database. A complete analysis of the proposed approach is carried out using different time series selection techniques to choose the most discriminative ones for the age group detection task: i) a statistical analysis, and ii) an automatic algorithm called Sequential Forward Search (SFS). In addition, different classification algorithms such as Dynamic Time Warping Barycenter Averaging (DBA) and Hidden Markov Models (HMM) are studied. Accuracy results over 85% are achieved, outperforming previous approaches in the literature and in more challenging age group conditions. Finally, the approach presented in this study can benefit many children-related applications, for example, towards an age-appropriate environment with the technology.
Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
Authors: Authors: Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2403.04696
Pdf link: https://arxiv.org/pdf/2403.04696
Abstract Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factual, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for six different LLMs and three languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.
AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors
Authors: Authors: Kaishen Yuan, Zitong Yu, Xin Liu, Weicheng Xie, Huanjing Yue, Jingyu Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.04697
Pdf link: https://arxiv.org/pdf/2403.04697
Abstract Facial Action Units (AU) is a vital concept in the realm of affective computing, and AU detection has always been a hot research topic. Existing methods suffer from overfitting issues due to the utilization of a large number of learnable parameters on scarce AU-annotated datasets or heavy reliance on substantial additional relevant data. Parameter-Efficient Transfer Learning (PETL) provides a promising paradigm to address these challenges, whereas its existing methods lack design for AU characteristics. Therefore, we innovatively investigate PETL paradigm to AU detection, introducing AUFormer and proposing a novel Mixture-of-Knowledge Expert (MoKE) collaboration mechanism. An individual MoKE specific to a certain AU with minimal learnable parameters first integrates personalized multi-scale and correlation knowledge. Then the MoKE collaborates with other MoKEs in the expert group to obtain aggregated information and inject it into the frozen Vision Transformer (ViT) to achieve parameter-efficient AU detection. Additionally, we design a Margin-truncated Difficulty-aware Weighted Asymmetric Loss (MDWA-Loss), which can encourage the model to focus more on activated AUs, differentiate the difficulty of unactivated AUs, and discard potential mislabeled samples. Extensive experiments from various perspectives, including within-domain, cross-domain, data efficiency, and micro-expression domain, demonstrate AUFormer's state-of-the-art performance and robust generalization abilities without relying on additional relevant data. The code for AUFormer is available at https://github.com/yuankaishen2001/AUFormer.
mmPlace: Robust Place Recognition with Intermediate Frequency Signal of Low-cost Single-chip Millimeter Wave Radar
Authors: Authors: Chengzhen Meng, Yifan Duan, Chenming He, Dequan Wang, Xiaoran Fan, Yanyong Zhang
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2403.04703
Pdf link: https://arxiv.org/pdf/2403.04703
Abstract Place recognition is crucial for tasks like loop-closure detection and re-localization. Single-chip millimeter wave radar (single-chip radar in short) emerges as a low-cost sensor option for place recognition, with the advantage of insensitivity to degraded visual environments. However, it encounters two challenges. Firstly, sparse point cloud from single-chip radar leads to poor performance when using current place recognition methods, which assume much denser data. Secondly, its performance significantly declines in scenarios involving rotational and lateral variations, due to limited overlap in its field of view (FOV). We propose mmPlace, a robust place recognition system to address these challenges. Specifically, mmPlace transforms intermediate frequency (IF) signal into range azimuth heatmap and employs a spatial encoder to extract features. Additionally, to improve the performance in scenarios involving rotational and lateral variations, mmPlace employs a rotating platform and concatenates heatmaps in a rotation cycle, effectively expanding the system's FOV. We evaluate mmPlace's performance on the milliSonic dataset, which is collected on the University of Science and Technology of China (USTC) campus, the city roads surrounding the campus, and an underground parking garage. The results demonstrate that mmPlace outperforms point cloud-based methods and achieves 87.37% recall@1 in scenarios involving rotational and lateral variations.
Keyword: face recognition

Explainable Face Verification via Feature-Guided Gradient Backpropagation
Authors: Authors: Yuhang Lu, Zewei Xu, Touradj Ebrahimi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2403.04549
Pdf link: https://arxiv.org/pdf/2403.04549
Abstract Recent years have witnessed significant advancement in face recognition (FR) techniques, with their applications widely spread in people's lives and security-sensitive areas. There is a growing need for reliable interpretations of decisions of such systems. Existing studies relying on various mechanisms have investigated the usage of saliency maps as an explanation approach, but suffer from different limitations. This paper first explores the spatial relationship between face image and its deep representation via gradient backpropagation. Then a new explanation approach FGGB has been conceived, which provides precise and insightful similarity and dissimilarity saliency maps to explain the "Accept" and "Reject" decision of an FR system. Extensive visual presentation and quantitative measurement have shown that FGGB achieves superior performance in both similarity and dissimilarity maps when compared to current state-of-the-art explainable face verification approaches.
Keyword: augmentation

A data-centric approach to class-specific bias in image data augmentation
Authors: Authors: Athanasios Angelakis, Andrey Rass
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04120
Pdf link: https://arxiv.org/pdf/2403.04120
Abstract Data augmentation (DA) enhances model generalization in computer vision but may introduce biases, impacting class accuracy unevenly. Our study extends this inquiry, examining DA's class-specific bias across various datasets, including those distinct from ImageNet, through random cropping. We evaluated this phenomenon with ResNet50, EfficientNetV2S, and SWIN ViT, discovering that while residual models showed similar bias effects, Vision Transformers exhibited greater robustness or altered dynamics. This suggests a nuanced approach to model selection, emphasizing bias mitigation. We also refined a "data augmentation robustness scouting" method to manage DA-induced biases more efficiently, reducing computational demands significantly (training 112 models instead of 1860; a reduction of factor 16.2) while still capturing essential bias trends.
UltraWiki: Ultra-fine-grained Entity Set Expansion with Negative Seed Entities
Authors: Authors: Yangning Li, Qingsong Lv, Tianyu Yu, Yinghui Li, Shulin Huang, Tingwei Lu, Xuming Hu, Wenhao JIang, Hai-Tao Zheng, Hui Wang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.04247
Pdf link: https://arxiv.org/pdf/2403.04247
Abstract Entity Set Expansion (ESE) aims to identify new entities belonging to the same semantic class as a given set of seed entities. Traditional methods primarily relied on positive seed entities to represent a target semantic class, which poses challenge for the representation of ultra-fine-grained semantic classes. Ultra-fine-grained semantic classes are defined based on fine-grained semantic classes with more specific attribute constraints. Describing it with positive seed entities alone cause two issues: (i) Ambiguity among ultra-fine-grained semantic classes. (ii) Inability to define "unwanted" semantic. Due to these inherent shortcomings, previous methods struggle to address the ultra-fine-grained ESE (Ultra-ESE). To solve this issue, we first introduce negative seed entities in the inputs, which belong to the same fine-grained semantic class as the positive seed entities but differ in certain attributes. Negative seed entities eliminate the semantic ambiguity by contrast between positive and negative attributes. Meanwhile, it provide a straightforward way to express "unwanted". To assess model performance in Ultra-ESE, we constructed UltraWiki, the first large-scale dataset tailored for Ultra-ESE. UltraWiki encompasses 236 ultra-fine-grained semantic classes, where each query of them is represented with 3-5 positive and negative seed entities. A retrieval-based framework RetExpan and a generation-based framework GenExpan are proposed to comprehensively assess the efficacy of large language models from two different paradigms in Ultra-ESE. Moreover, we devised three strategies to enhance models' comprehension of ultra-fine-grained entities semantics: contrastive learning, retrieval augmentation, and chain-of-thought reasoning. Extensive experiments confirm the effectiveness of our proposed strategies and also reveal that there remains a large space for improvement in Ultra-ESE.
Depth-aware Test-Time Training for Zero-shot Video Object Segmentation
Authors: Authors: Weihuang Liu, Xi Shen, Haolun Li, Xiuli Bi, Bo Liu, Chi-Man Pun, Xiaodong Cun
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04258
Pdf link: https://arxiv.org/pdf/2403.04258
Abstract Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary moving object without any human annotations. Mainstream solutions mainly focus on learning a single model on large-scale video datasets, which struggle to generalize to unseen videos. In this work, we introduce a test-time training (TTT) strategy to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. In detail, we first train a single network to perform both segmentation and depth prediction tasks. This can be effectively learned with our specifically designed depth modulation layer. Then, for the TTT process, the model is updated by predicting consistent depth maps for the same frame under different data augmentations. In addition, we explore different TTT weight updating strategies. Our empirical results suggest that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements. Experiments show that the proposed method achieves clear improvements on ZSVOS. Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods. Our code is available at: https://nifangbaage.github.io/DATTT.
SSDRec: Self-Augmented Sequence Denoising for Sequential Recommendation
Authors: Authors: Chi Zhang, Qilong Han, Rui Chen, Xiangyu Zhao, Peng Tang, Hongtao Song
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2403.04278
Pdf link: https://arxiv.org/pdf/2403.04278
Abstract Traditional sequential recommendation methods assume that users' sequence data is clean enough to learn accurate sequence representations to reflect user preferences. In practice, users' sequences inevitably contain noise (e.g., accidental interactions), leading to incorrect reflections of user preferences. Consequently, some pioneer studies have explored modeling sequentiality and correlations in sequences to implicitly or explicitly reduce noise's influence. However, relying on only available intra-sequence information (i.e., sequentiality and correlations in a sequence) is insufficient and may result in over-denoising and under-denoising problems (OUPs), especially for short sequences. To improve reliability, we propose to augment sequences by inserting items before denoising. However, due to the data sparsity issue and computational costs, it is challenging to select proper items from the entire item universe to insert into proper positions in a target sequence. Motivated by the above observation, we propose a novel framework--Self-augmented Sequence Denoising for sequential Recommendation (SSDRec) with a three-stage learning paradigm to solve the above challenges. In the first stage, we empower SSDRec by a global relation encoder to learn multi-faceted inter-sequence relations in a data-driven manner. These relations serve as prior knowledge to guide subsequent stages. In the second stage, we devise a self-augmentation module to augment sequences to alleviate OUPs. Finally, we employ a hierarchical denoising module in the third stage to reduce the risk of false augmentations and pinpoint all noise in raw sequences. Extensive experiments on five real-world datasets demonstrate the superiority of \model over state-of-the-art denoising methods and its flexible applications to mainstream sequential recommendation models. The source code is available at https://github.com/zc-97/SSDRec.
Can Your Model Tell a Negation from an Implicature? Unravelling Challenges With Intent Encoders
Authors: Authors: Yuwei Zhang, Siffi Singh, Sailik Sengupta, Igor Shalyminov, Hang Su, Hwanjun Song, Saab Mansour
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.04314
Pdf link: https://arxiv.org/pdf/2403.04314
Abstract Conversational systems often rely on embedding models for intent classification and intent clustering tasks. The advent of Large Language Models (LLMs), which enable instructional embeddings allowing one to adjust semantics over the embedding space using prompts, are being viewed as a panacea for these downstream conversational tasks. However, traditional evaluation benchmarks rely solely on task metrics that don't particularly measure gaps related to semantic understanding. Thus, we propose an intent semantic toolkit that gives a more holistic view of intent embedding models by considering three tasks -- (1) intent classification, (2) intent clustering, and (3) a novel triplet task. The triplet task gauges the model's understanding of two semantic concepts paramount in real-world conversational systems -- negation and implicature. We observe that current embedding models fare poorly in semantic understanding of these concepts. To address this, we propose a pre-training approach to improve the embedding model by leveraging augmentation with data generated by an auto-regressive model and a contrastive loss term. Our approach improves the semantic understanding of the intent embedding model on the aforementioned linguistic dimensions while slightly effecting their performance on downstream task metrics.
Online Adaptation of Language Models with a Memory of Amortized Contexts
Authors: Authors: Jihoon Tack, Jaehyung Kim, Eric Mitchell, Jinwoo Shin, Yee Whye Teh, Jonathan Richard Schwarz
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2403.04317
Pdf link: https://arxiv.org/pdf/2403.04317
Abstract Due to the rapid generation and dissemination of information, large language models (LLMs) quickly run out of date despite enormous development costs. Due to this crucial need to keep models updated, online learning has emerged as a critical necessity when utilizing LLMs for real-world applications. However, given the ever-expanding corpus of unseen documents and the large parameter space of modern LLMs, efficient adaptation is essential. To address these challenges, we propose Memory of Amortized Contexts (MAC), an efficient and effective online adaptation framework for LLMs with strong knowledge retention. We propose an amortized feature extraction and memory-augmentation approach to compress and extract information from new documents into compact modulations stored in a memory bank. When answering questions, our model attends to and extracts relevant knowledge from this memory bank. To learn informative modulations in an efficient manner, we utilize amortization-based meta-learning, which substitutes the optimization process with a single forward pass of the encoder. Subsequently, we learn to choose from and aggregate selected documents into a single modulation by conditioning on the question, allowing us to adapt a frozen language model during test time without requiring further gradient updates. Our experiment demonstrates the superiority of MAC in multiple aspects, including online adaptation performance, time, and memory efficiency. Code is available at: https://github.com/jihoontack/MAC.
Symmetry Considerations for Learning Task Symmetric Robot Policies
Authors: Authors: Mayank Mittal, Nikita Rudin, Victor Klemm, Arthur Allshire, Marco Hutter
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.04359
Pdf link: https://arxiv.org/pdf/2403.04359
Abstract Symmetry is a fundamental aspect of many real-world robotic tasks. However, current deep reinforcement learning (DRL) approaches can seldom harness and exploit symmetry effectively. Often, the learned behaviors fail to achieve the desired transformation invariances and suffer from motion artifacts. For instance, a quadruped may exhibit different gaits when commanded to move forward or backward, even though it is symmetrical about its torso. This issue becomes further pronounced in high-dimensional or complex environments, where DRL methods are prone to local optima and fail to explore regions of the state space equally. Past methods on encouraging symmetry for robotic tasks have studied this topic mainly in a single-task setting, where symmetry usually refers to symmetry in the motion, such as the gait patterns. In this paper, we revisit this topic for goal-conditioned tasks in robotics, where symmetry lies mainly in task execution and not necessarily in the learned motions themselves. In particular, we investigate two approaches to incorporate symmetry invariance into DRL -- data augmentation and mirror loss function. We provide a theoretical foundation for using augmented samples in an on-policy setting. Based on this, we show that the corresponding approach achieves faster convergence and improves the learned behaviors in various challenging robotic tasks, from climbing boxes with a quadruped to dexterous manipulation.
Low-Resource Court Judgment Summarization for Common Law Systems
Authors: Authors: Shuaiqi Liu, Jiannong Cao, Yicong Li, Ruosong Yang, Zhiyuan Wen
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2403.04454
Pdf link: https://arxiv.org/pdf/2403.04454
Abstract Common law courts need to refer to similar precedents' judgments to inform their current decisions. Generating high-quality summaries of court judgment documents can facilitate legal practitioners to efficiently review previous cases and assist the general public in accessing how the courts operate and how the law is applied. Previous court judgment summarization research focuses on civil law or a particular jurisdiction's judgments. However, judges can refer to the judgments from all common law jurisdictions. Current summarization datasets are insufficient to satisfy the demands of summarizing precedents across multiple jurisdictions, especially when labeled data are scarce for many jurisdictions. To address the lack of datasets, we present CLSum, the first dataset for summarizing multi-jurisdictional common law court judgment documents. Besides, this is the first court judgment summarization work adopting large language models (LLMs) in data augmentation, summary generation, and evaluation. Specifically, we design an LLM-based data augmentation method incorporating legal knowledge. We also propose a legal knowledge enhanced evaluation metric based on LLM to assess the quality of generated judgment summaries. Our experimental results verify that the LLM-based summarization methods can perform well in the few-shot and zero-shot settings. Our LLM-based data augmentation method can mitigate the impact of low data resources. Furthermore, we carry out comprehensive comparative experiments to find essential model components and settings that are capable of enhancing summarization performance.
Exploring the Design Space of Optical See-through AR Head-Mounted Displays to Support First Responders in the Field
Authors: Authors: Kexin Zhang, Brianna Cochran, Ruijia Chen, Lance Hartung, Bryce Sprecher, Ross Tredinnick, Kevin Ponto, Suman Banerjee, Yuhang Zhao
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2403.04660
Pdf link: https://arxiv.org/pdf/2403.04660
Abstract First responders (FRs) navigate hazardous, unfamiliar environments in the field (e.g., mass-casualty incidents), making life-changing decisions in a split second. AR head-mounted displays (HMDs) have shown promise in supporting them due to its capability of recognizing and augmenting the challenging environments in a hands-free manner. However, the design space have not been thoroughly explored by involving various FRs who serve different roles (e.g., firefighters, law enforcement) but collaborate closely in the field. We interviewed 26 first responders in the field who experienced a state-of-the-art optical-see-through AR HMD, as well as its interaction techniques and four types of AR cues (i.e., overview cues, directional cues, highlighting cues, and labeling cues), soliciting their first-hand experiences, design ideas, and concerns. Our study revealed both generic and role-specific preferences and needs for AR hardware, interactions, and feedback, as well as identifying desired AR designs tailored to urgent, risky scenarios (e.g., affordance augmentation to facilitate fast and safe action). While acknowledging the value of AR HMDs, concerns were also raised around trust, privacy, and proper integration with other equipment. Finally, we derived comprehensive and actionable design guidelines to inform future AR systems for in-field FRs.
Delving into the Trajectory Long-tail Distribution for Muti-object Tracking
Authors: Authors: Sijia Chen, En Yu, Jinyang Li, Wenbing Tao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2403.04700
Pdf link: https://arxiv.org/pdf/2403.04700
Abstract Multiple Object Tracking (MOT) is a critical area within computer vision, with a broad spectrum of practical implementations. Current research has primarily focused on the development of tracking algorithms and enhancement of post-processing techniques. Yet, there has been a lack of thorough examination concerning the nature of tracking data it self. In this study, we pioneer an exploration into the distribution patterns of tracking data and identify a pronounced long-tail distribution issue within existing MOT datasets. We note a significant imbalance in the distribution of trajectory lengths across different pedestrians, a phenomenon we refer to as "pedestrians trajectory long-tail distribution". Addressing this challenge, we introduce a bespoke strategy designed to mitigate the effects of this skewed distribution. Specifically, we propose two data augmentation strategies, including Stationary Camera View Data Augmentation (SVA) and Dynamic Camera View Data Augmentation (DVA) , designed for viewpoint states and the Group Softmax (GS) module for Re-ID. SVA is to backtrack and predict the pedestrian trajectory of tail classes, and DVA is to use diffusion model to change the background of the scene. GS divides the pedestrians into unrelated groups and performs softmax operation on each group individually. Our proposed strategies can be integrated into numerous existing tracking systems, and extensive experimentation validates the efficacy of our method in reducing the influence of long-tail distribution on multi-object tracking performance. The code is available at https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT.

LeeKyungwook / get-arxiv-noti

New submissions for Fri, 8 Mar 24 #1010

Keyword: detection

Video Relationship Detection Using Mixture of Experts

OpenVPN is Open to VPN Fingerprinting

Human I/O: Towards a Unified Approach to Detecting Situational Impairments

Three Revisits to Node-Level Graph Anomaly Detection: Outliers, Message Passing and Hyperbolic Neural Networks

ZTRAN: Prototyping Zero Trust Security xApps for Open Radio Access Network Deployments

Scalable and Robust Transformer Decoders for Interpretable Image Classification with Foundation Models

An Explainable AI Framework for Artificial Intelligence of Medical Things

FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning

Dual-path Frequency Discriminators for Few-shot Anomaly Detection

Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation

VAEMax: Open-Set Intrusion Detection based on OpenMax and Variational Autoencoder

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Bi-Static Sensing in OFDM Wireless Systems for Indoor Scenarios

MKF-ADS: A Multi-Knowledge Fused Anomaly Detection System for Automotive

Effectiveness Assessment of Recent Large Vision-Language Models

AO-DETR: Anti-Overlapping DETR for X-Ray Prohibited Items Detection

Impacts of Color and Texture Distortions on Earth Observation Data in Deep Learning

Exploring the Influence of Dimensionality Reduction on Anomaly Detection Performance in Multivariate Time Series

FriendNet: Detection-Friendly Dehazing Network

A Survey of Graph Neural Networks in Real world: Imbalance, Noise, Privacy and OOD Challenges

Children Age Group Detection based on Human-Computer Interaction and Time Series Analysis

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

AUFormer: Vision Transformers are Parameter-Efficient Facial Action Unit Detectors

mmPlace: Robust Place Recognition with Intermediate Frequency Signal of Low-cost Single-chip Millimeter Wave Radar

Keyword: face recognition

Explainable Face Verification via Feature-Guided Gradient Backpropagation

Keyword: augmentation

A data-centric approach to class-specific bias in image data augmentation

UltraWiki: Ultra-fine-grained Entity Set Expansion with Negative Seed Entities

Depth-aware Test-Time Training for Zero-shot Video Object Segmentation

SSDRec: Self-Augmented Sequence Denoising for Sequential Recommendation

Can Your Model Tell a Negation from an Implicature? Unravelling Challenges With Intent Encoders

Online Adaptation of Language Models with a Memory of Amortized Contexts

Symmetry Considerations for Learning Task Symmetric Robot Policies

Low-Resource Court Judgment Summarization for Common Law Systems

Exploring the Design Space of Optical See-through AR Head-Mounted Displays to Support First Responders in the Field

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking