New submissions for Thu, 1 Feb 24

Keyword: detection

Detecting mental disorder on social media: a ChatGPT-augmented explainable approach

Authors: Authors: Loris Belcastro, Riccardo Cantini, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2401.17477
Pdf link: https://arxiv.org/pdf/2401.17477
Abstract In the digital era, the prevalence of depressive symptoms expressed on social media has raised serious concerns, necessitating advanced methodologies for timely detection. This paper addresses the challenge of interpretable depression detection by proposing a novel methodology that effectively combines Large Language Models (LLMs) with eXplainable Artificial Intelligence (XAI) and conversational agents like ChatGPT. In our methodology, explanations are achieved by integrating BERTweet, a Twitter-specific variant of BERT, into a novel self-explanatory model, namely BERT-XDD, capable of providing both classification and explanations via masked attention. The interpretability is further enhanced using ChatGPT to transform technical explanations into human-readable commentaries. By introducing an effective and modular approach for interpretable depression detection, our methodology can contribute to the development of socially responsible digital platforms, fostering early intervention and support for mental health challenges under the guidance of qualified healthcare professionals.
AdvGPS: Adversarial GPS for Multi-Agent Perception Attack
Authors: Authors: Jinlong Li, Baolu Li, Xinyu Liu, Jianwu Fang, Felix Juefei-Xu, Qing Guo, Hongkai Yu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17499
Pdf link: https://arxiv.org/pdf/2401.17499
Abstract The multi-agent perception system collects visual data from sensors located on various agents and leverages their relative poses determined by GPS signals to effectively fuse information, mitigating the limitations of single-agent sensing, such as occlusion. However, the precision of GPS signals can be influenced by a range of factors, including wireless transmission and obstructions like buildings. Given the pivotal role of GPS signals in perception fusion and the potential for various interference, it becomes imperative to investigate whether specific GPS signals can easily mislead the multi-agent perception system. To address this concern, we frame the task as an adversarial attack challenge and introduce \textsc{AdvGPS}, a method capable of generating adversarial GPS signals which are also stealthy for individual agents within the system, significantly reducing object detection accuracy. To enhance the success rates of these attacks in a black-box scenario, we introduce three types of statistically sensitive natural discrepancies: appearance-based discrepancy, distribution-based discrepancy, and task-aware discrepancy. Our extensive experiments on the OPV2V dataset demonstrate that these attacks substantially undermine the performance of state-of-the-art methods, showcasing remarkable transferability across different point cloud based 3D detection systems. This alarming revelation underscores the pressing need to address security implications within multi-agent perception systems, thereby underscoring a critical area of research.
Effective Multi-Stage Training Model For Edge Computing Devices In Intrusion Detection
Authors: Authors: Thua Huynh Trong, Thanh Nguyen Hoang
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.17546
Pdf link: https://arxiv.org/pdf/2401.17546
Abstract Intrusion detection poses a significant challenge within expansive and persistently interconnected environments. As malicious code continues to advance and sophisticated attack methodologies proliferate, various advanced deep learning-based detection approaches have been proposed. Nevertheless, the complexity and accuracy of intrusion detection models still need further enhancement to render them more adaptable to diverse system categories, particularly within resource-constrained devices, such as those embedded in edge computing systems. This research introduces a three-stage training paradigm, augmented by an enhanced pruning methodology and model compression techniques. The objective is to elevate the system's effectiveness, concurrently maintaining a high level of accuracy for intrusion detection. Empirical assessments conducted on the UNSW-NB15 dataset evince that this solution notably reduces the model's dimensions, while upholding accuracy levels equivalent to similar proposals.
Local Feature Matching Using Deep Learning: A Survey
Authors: Authors: Shibiao Xu, Shunpeng Chen, Rongtao Xu, Changwei Wang, Peng Lu, Li Guo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17592
Pdf link: https://arxiv.org/pdf/2401.17592
Abstract Local feature matching enjoys wide-ranging applications in the realm of computer vision, encompassing domains such as image retrieval, 3D reconstruction, and object recognition. However, challenges persist in improving the accuracy and robustness of matching due to factors like viewpoint and lighting variations. In recent years, the introduction of deep learning models has sparked widespread exploration into local feature matching techniques. The objective of this endeavor is to furnish a comprehensive overview of local feature matching methods. These methods are categorized into two key segments based on the presence of detectors. The Detector-based category encompasses models inclusive of Detect-then-Describe, Joint Detection and Description, Describe-then-Detect, as well as Graph Based techniques. In contrast, the Detector-free category comprises CNN Based, Transformer Based, and Patch Based methods. Our study extends beyond methodological analysis, incorporating evaluations of prevalent datasets and metrics to facilitate a quantitative comparison of state-of-the-art techniques. The paper also explores the practical application of local feature matching in diverse domains such as Structure from Motion, Remote Sensing Image Registration, and Medical Image Registration, underscoring its versatility and significance across various fields. Ultimately, we endeavor to outline the current challenges faced in this domain and furnish future research directions, thereby serving as a reference for researchers involved in local feature matching and its interconnected domains.
SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization
Authors: Authors: Sangwoo Cho, Kaiqiang Song, Chao Zhao, Xiaoyang Wang, Dong Yu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.17597
Pdf link: https://arxiv.org/pdf/2401.17597
Abstract Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations. Traditional language models often overlook the distinct features of these dialogues by treating them as regular text. In this paper, we propose a speaker-enhanced pre-training method for long dialogue summarization, which leverages the inherent structure of multiple-turn dialogues. To support our study, we curate a diverse dataset that includes transcripts from real-world scenarios, movie or TV show transcripts, and dialogues generated by a Large Language Model. We then perform a pre-training, which encompasses the detection of speaker changes, and masked utterance generation. Experimental results of fine-tuned models demonstrate that our model achieves state-of-the-art performance on downstream benchmarks with long context, surpassing baseline models and highlighting the effectiveness of our approach. Our findings highlight the importance of curating pre-training datasets that exhibit diversity and variations in length distribution to ensure effective alignment with downstream datasets.
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data
Authors: Authors: Chenhui Zhang, Sherrie Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17600
Pdf link: https://arxiv.org/pdf/2401.17600
Abstract Large Vision-Language Models (VLMs) have demonstrated impressive performance on complex tasks involving visual input with natural language instructions. However, it remains unclear to what extent capabilities on natural images transfer to Earth observation (EO) data, which are predominantly satellite and aerial images less common in VLM training data. In this work, we propose a comprehensive benchmark to gauge the progress of VLMs toward being useful tools for EO data by assessing their abilities on scene understanding, localization and counting, and change detection tasks. Motivated by real-world applications, our benchmark includes scenarios like urban monitoring, disaster relief, land use, and conservation. We discover that, although state-of-the-art VLMs like GPT-4V possess extensive world knowledge that leads to strong performance on open-ended tasks like location understanding and image captioning, their poor spatial reasoning limits usefulness on object localization and counting tasks. Our benchmark will be made publicly available at https://vleo.danielz.ch/ and on Hugging Face at https://huggingface.co/collections/mit-ei/vleo-benchmark-datasets-65b789b0466555489cce0d70 for easy model evaluation.
Assertion Detection Large Language Model In-context Learning LoRA Fine-tuning
Authors: Authors: Yuelyu Ji, Zeshui Yu, Yanshan Wang
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2401.17602
Pdf link: https://arxiv.org/pdf/2401.17602
Abstract In this study, we aim to address the task of assertion detection when extracting medical concepts from clinical notes, a key process in clinical natural language processing (NLP). Assertion detection in clinical NLP usually involves identifying assertion types for medical concepts in the clinical text, namely certainty (whether the medical concept is positive, negated, possible, or hypothetical), temporality (whether the medical concept is for present or the past history), and experiencer (whether the medical concept is described for the patient or a family member). These assertion types are essential for healthcare professionals to quickly and clearly understand the context of medical conditions from unstructured clinical texts, directly influencing the quality and outcomes of patient care. Although widely used, traditional methods, particularly rule-based NLP systems and machine learning or deep learning models, demand intensive manual efforts to create patterns and tend to overlook less common assertion types, leading to an incomplete understanding of the context. To address this challenge, our research introduces a novel methodology that utilizes Large Language Models (LLMs) pre-trained on a vast array of medical data for assertion detection. We enhanced the current method with advanced reasoning techniques, including Tree of Thought (ToT), Chain of Thought (CoT), and Self-Consistency (SC), and refine it further with Low-Rank Adaptation (LoRA) fine-tuning. We first evaluated the model on the i2b2 2010 assertion dataset. Our method achieved a micro-averaged F-1 of 0.89, with 0.11 improvements over the previous works. To further assess the generalizability of our approach, we extended our evaluation to a local dataset that focused on sleep concept extraction. Our approach achieved an F-1 of 0.74, which is 0.31 higher than the previous method.
Social Robot Navigation with Adaptive Proxemics Based on Emotions
Authors: Authors: Baris Bilen, Hasan Kivrak Pinar Uluer, Hatice Kose
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.17663
Pdf link: https://arxiv.org/pdf/2401.17663
Abstract The primary aim of this paper is to investigate the integration of emotions into the social navigation framework to analyse its effect on both navigation and human physiological safety and comfort. The proposed framework uses leg detection to find the whereabouts of people and computes adaptive proxemic zones based on their emotional state. We designed several case studies in a simulated environment and examined 3 different emotions; positive (happy), neutral and negative (angry). A survey study was conducted with 70 participants to explore their impressions about the navigation of the robot and compare the human safety and comfort measurements results. Both survey and simulation results showed that integrating emotions into proxemic zones has a significant effect on the physical safety of a human. The results revealed that when a person is angry, the robot is expected to navigate further than the standard distance to support his/her physiological comfort and safety. The results also showed that reducing the navigation distance is not preferred when a person is happy.
Unified Physical-Digital Face Attack Detection
Authors: Authors: Hao Fang, Ajian Liu, Haocheng Yuan, Junze Zheng, Dingheng Zeng, Yanhong Liu, Jiankang Deng, Sergio Escalera, Xiaoming Liu, Jun Wan, Zhen Lei
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17699
Pdf link: https://arxiv.org/pdf/2401.17699
Abstract Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks with ID consistency which means the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of $1,800$ participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 29,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection.
LoRec: Large Language Model for Robust Sequential Recommendation against Poisoning Attacks
Authors: Authors: Kaike Zhang, Qi Cao, Yunfan Wu, Fei Sun, Huawei Shen, Xueqi Cheng
Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2401.17723
Pdf link: https://arxiv.org/pdf/2401.17723
Abstract Sequential recommender systems stand out for their ability to capture users' dynamic interests and the patterns of item-to-item transitions. However, the inherent openness of sequential recommender systems renders them vulnerable to poisoning attacks, where fraudulent users are injected into the training data to manipulate learned patterns. Traditional defense strategies predominantly depend on predefined assumptions or rules extracted from specific known attacks, limiting their generalizability to unknown attack types. To solve the above problems, considering the rich open-world knowledge encapsulated in Large Language Models (LLMs), our research initially focuses on the capabilities of LLMs in the detection of unknown fraudulent activities within recommender systems, a strategy we denote as LLM4Dec. Empirical evaluations demonstrate the substantial capability of LLMs in identifying unknown fraudsters, leveraging their expansive, open-world knowledge. Building upon this, we propose the integration of LLMs into defense strategies to extend their effectiveness beyond the confines of known attacks. We propose LoRec, an advanced framework that employs LLM-Enhanced Calibration to strengthen the robustness of sequential recommender systems against poisoning attacks. LoRec integrates an LLM-enhanced CalibraTor (LCT) that refines the training process of sequential recommender systems with knowledge derived from LLMs, applying a user-wise reweighting to diminish the impact of fraudsters injected by attacks. By incorporating LLMs' open-world knowledge, the LCT effectively converts the limited, specific priors or rules into a more general pattern of fraudsters, offering improved defenses against poisoning attacks. Our comprehensive experiments validate that LoRec, as a general framework, significantly strengthens the robustness of sequential recommender systems.
Haris: an Advanced Autonomous Mobile Robot for Smart Parking Assistance
Authors: Authors: Layth Hamad, Muhammad Asif Khan, Hamid Menouar, Fethi Filali, Amr Mohamed
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17741
Pdf link: https://arxiv.org/pdf/2401.17741
Abstract This paper presents Haris, an advanced autonomous mobile robot system for tracking the location of vehicles in crowded car parks using license plate recognition. The system employs simultaneous localization and mapping (SLAM) for autonomous navigation and precise mapping of the parking area, eliminating the need for GPS dependency. In addition, the system utilizes a sophisticated framework using computer vision techniques for object detection and automatic license plate recognition (ALPR) for reading and associating license plate numbers with location data. This information is subsequently synchronized with a back-end service and made accessible to users via a user-friendly mobile app, offering effortless vehicle location and alleviating congestion within the parking facility. The proposed system has the potential to improve the management of short-term large outdoor parking areas in crowded places such as sports stadiums. The demo of the robot can be found on https://youtu.be/ZkTCM35fxa0?si=QjggJuN7M1o3oifx.
PF-GNN: Differentiable particle filtering based approximation of universal graph representations
Authors: Authors: Mohammed Haroon Dupty, Yanfei Dong, Wee Sun Lee
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17752
Pdf link: https://arxiv.org/pdf/2401.17752
Abstract Message passing Graph Neural Networks (GNNs) are known to be limited in expressive power by the 1-WL color-refinement test for graph isomorphism. Other more expressive models either are computationally expensive or need preprocessing to extract structural features from the graph. In this work, we propose to make GNNs universal by guiding the learning process with exact isomorphism solver techniques which operate on the paradigm of Individualization and Refinement (IR), a method to artificially introduce asymmetry and further refine the coloring when 1-WL stops. Isomorphism solvers generate a search tree of colorings whose leaves uniquely identify the graph. However, the tree grows exponentially large and needs hand-crafted pruning techniques which are not desirable from a learning perspective. We take a probabilistic view and approximate the search tree of colorings (i.e. embeddings) by sampling multiple paths from root to leaves of the search tree. To learn more discriminative representations, we guide the sampling process with particle filter updates, a principled approach for sequential state estimation. Our algorithm is end-to-end differentiable, can be applied with any GNN as backbone and learns richer graph representations with only linear increase in runtime. Experimental evaluation shows that our approach consistently outperforms leading GNN models on both synthetic benchmarks for isomorphism detection as well as real-world datasets.
AI-enabled Cyber-Physical In-Orbit Factory -- AI approaches based on digital twin technology for robotic small satellite production
Authors: Authors: Florian Leutert, David Bohlig, Florian Kempf, Klaus Schilling, Maximilian Mühlbauer, Bengisu Ayan, Thomas Hulin, Freek Stulp, Alin Albu-Schäffer, Vladimir Kutscher, Christian Plesker, Thomas Dasbach, Stephan Damm, Reiner Anderl, Benjamin Schleich
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.17799
Pdf link: https://arxiv.org/pdf/2401.17799
Abstract With the ever increasing number of active satellites in space, the rising demand for larger formations of small satellites and the commercialization of the space industry (so-called New Space), the realization of manufacturing processes in orbit comes closer to reality. Reducing launch costs and risks, allowing for faster on-demand deployment of individually configured satellites as well as the prospect for possible on-orbit servicing for satellites makes the idea of realizing an in-orbit factory promising. In this paper, we present a novel approach to an in-orbit factory of small satellites covering a digital process twin, AI-based fault detection, and teleoperated robot-control, which are being researched as part of the "AI-enabled Cyber-Physical In-Orbit Factory" project. In addition to the integration of modern automation and Industry 4.0 production approaches, the question of how artificial intelligence (AI) and learning approaches can be used to make the production process more robust, fault-tolerant and autonomous is addressed. This lays the foundation for a later realisation of satellite production in space in the form of an in-orbit factory. Central aspect is the development of a robotic AIT (Assembly, Integration and Testing) system where a small satellite could be assembled by a manipulator robot from modular subsystems. Approaches developed to improving this production process with AI include employing neural networks for optical and electrical fault detection of components. Force sensitive measuring and motion training helps to deal with uncertainties and tolerances during assembly. An AI-guided teleoperated control of the robot arm allows for human intervention while a Digital Process Twin represents process data and provides supervision during the whole production process. Approaches and results towards automated satellite production are presented in detail.
QTFlow: Quantitative Timing-Sensitive Information Flow for Security-Aware Hardware Design on RTL
Authors: Authors: Lennart M. Reimann, Anschul Prashar, Chiara Ghinami, Rebecca Pelke, Dominik Sisejkovic, Farhad Merchant, Rainer Leupers
Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/2401.17819
Pdf link: https://arxiv.org/pdf/2401.17819
Abstract In contemporary Electronic Design Automation (EDA) tools, security often takes a backseat to the primary goals of power, performance, and area optimization. Commonly, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while keeping performance and area in mind. Cutting-edge methods employ information flow analysis to identify inadvertent information leaks in design structures. Current information leakage detection methods use quantitative information flow analysis to quantify the leaks. However, handling sequential circuits poses challenges for state-of-the-art techniques due to their time-agnostic nature, overlooking timing channels, and introducing false positives. To address this, we introduce QTFlow, a timing-sensitive framework for quantifying hardware information leakages during the design phase. Illustrating its effectiveness on open-source benchmarks, QTFlow autonomously identifies timing channels and diminishes all false positives arising from time-agnostic analysis when contrasted with current state-of-the-art techniques.
Do Object Detection Localization Errors Affect Human Performance and Trust?
Authors: Authors: Sven de Witte, Ombretta Strafforello, Jan van Gemert
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2401.17821
Pdf link: https://arxiv.org/pdf/2401.17821
Abstract Bounding boxes are often used to communicate automatic object detection results to humans, aiding humans in a multitude of tasks. We investigate the relationship between bounding box localization errors and human task performance. We use observer performance studies on a visual multi-object counting task to measure both human trust and performance with different levels of bounding box accuracy. The results show that localization errors have no significant impact on human accuracy or trust in the system. Recall and precision errors impact both human performance and trust, suggesting that optimizing algorithms based on the F1 score is more beneficial in human-computer tasks. Lastly, the paper offers an improvement on bounding boxes in multi-object counting tasks with center dots, showing improved performance and better resilience to localization inaccuracy.
Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation
Authors: Authors: Rozhan Ahmadi, Shohreh Kasaei
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17828
Pdf link: https://arxiv.org/pdf/2401.17828
Abstract In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of Convolutional Neural Networks (CNNs), CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer.
Semantic Anything in 3D Gaussians
Authors: Authors: Xu Hu, Yuxi Wang, Lue Fan, Junsong Fan, Junran Peng, Zhen Lei, Qing Li, Zhaoxiang Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17857
Pdf link: https://arxiv.org/pdf/2401.17857
Abstract 3D Gaussian Splatting has emerged as an alternative 3D representation of Neural Radiance Fields (NeRFs), benefiting from its high-quality rendering results and real-time rendering speed. Considering the 3D Gaussian representation remains unparsed, it is necessary first to execute object segmentation within this domain. Subsequently, scene editing and collision detection can be performed, proving vital to a multitude of applications, such as virtual reality (VR), augmented reality (AR), game/movie production, etc. In this paper, we propose a novel approach to achieve object segmentation in 3D Gaussian via an interactive procedure without any training process and learned parameters. We refer to the proposed method as SA-GS, for Segment Anything in 3D Gaussians. Given a set of clicked points in a single input view, SA-GS can generalize SAM to achieve 3D consistent segmentation via the proposed multi-view mask generation and view-wise label assignment methods. We also propose a cross-view label-voting approach to assign labels from different views. In addition, in order to address the boundary roughness issue of segmented objects resulting from the non-negligible spatial sizes of 3D Gaussian located at the boundary, SA-GS incorporates the simple but effective Gaussian Decomposition scheme. Extensive experiments demonstrate that SA-GS achieves high-quality 3D segmentation results, which can also be easily applied for scene editing and collision detection tasks. Codes will be released soon.
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
Authors: Authors: Jonas Ricker, Denis Lukovnikov, Asja Fischer
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17879
Pdf link: https://arxiv.org/pdf/2401.17879
Abstract With recent text-to-image models, anyone can generate deceptively realistic images with arbitrary contents, fueling the growing threat of visual disinformation. A key enabler for generating high-resolution images with low computational cost has been the development of latent diffusion models (LDMs). In contrast to conventional diffusion models, LDMs perform the denoising process in the low-dimensional latent space of a pre-trained autoencoder (AE) instead of the high-dimensional image space. Despite their relevance, the forensic analysis of LDMs is still in its infancy. In this work we propose AEROBLADE, a novel detection method which exploits an inherent component of LDMs: the AE used to transform images between image and latent space. We find that generated images can be more accurately reconstructed by the AE than real images, allowing for a simple detection approach based on the reconstruction error. Most importantly, our method is easy to implement and does not require any training, yet nearly matches the performance of detectors that rely on extensive training. We empirically demonstrate that AEROBLADE is effective against state-of-the-art LDMs including Stable Diffusion and Midjourney. Beyond detection, our approach allows for the qualitative analysis of images, which can be leveraged for identifying inpainted regions.
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
Authors: Authors: Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, Dacheng Tao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17904
Pdf link: https://arxiv.org/pdf/2401.17904
Abstract The Segment Anything Model (SAM), a profound vision foundation model pre-trained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in text segmentation across four hierarchies, including stroke, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality text stroke segmentation (TSS) model through a parameter-efficient fine-tuning approach. We use this TSS model to iteratively generate the text stroke labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TSS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation mode. In terms of the AMG mode, Hi-SAM segments text stroke foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the promptable mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TSS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for text stroke segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring 20x fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.
SubPipe: A Submarine Pipeline Inspection Dataset for Segmentation and Visual-inertial Localization
Authors: Authors: Olaya Álvarez-Tuñón, Luiza Ribeiro Marnet, László Antal, Martin Aubard, Maria Costa, Yury Brodskiy
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.17907
Pdf link: https://arxiv.org/pdf/2401.17907
Abstract This paper presents SubPipe, an underwater dataset for SLAM, object detection, and image segmentation. SubPipe has been recorded using a \gls{LAUV}, operated by OceanScan MST, and carrying a sensor suite including two cameras, a side-scan sonar, and an inertial navigation system, among other sensors. The AUV has been deployed in a pipeline inspection environment with a submarine pipe partially covered by sand. The AUV's pose ground truth is estimated from the navigation sensors. The side-scan sonar and RGB images include object detection and segmentation annotations, respectively. State-of-the-art segmentation, object detection, and SLAM methods are benchmarked on SubPipe to demonstrate the dataset's challenges and opportunities for leveraging computer vision algorithms. To the authors' knowledge, this is the first annotated underwater dataset providing a real pipeline inspection scenario. The dataset and experiments are publicly available online at https://github.com/remaro-network/SubPipe-dataset
Source-free Domain Adaptive Object Detection in Remote Sensing Images
Authors: Authors: Weixing Liu, Jun Liu, Xin Su, Han Nie, Bin Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17916
Pdf link: https://arxiv.org/pdf/2401.17916
Abstract Recent studies have used unsupervised domain adaptive object detection (UDAOD) methods to bridge the domain gap in remote sensing (RS) images. However, UDAOD methods typically assume that the source domain data can be accessed during the domain adaptation process. This setting is often impractical in the real world due to RS data privacy and transmission difficulty. To address this challenge, we propose a practical source-free object detection (SFOD) setting for RS images, which aims to perform target domain adaptation using only the source pre-trained model. We propose a new SFOD method for RS images consisting of two parts: perturbed domain generation and alignment. The proposed multilevel perturbation constructs the perturbed domain in a simple yet efficient form by perturbing the domain-variant features at the image level and feature level according to the color and style bias. The proposed multilevel alignment calculates feature and label consistency between the perturbed domain and the target domain across the teacher-student network, and introduces the distillation of feature prototype to mitigate the noise of pseudo-labels. By requiring the detector to be consistent in the perturbed domain and the target domain, the detector is forced to focus on domaininvariant features. Extensive results of three synthetic-to-real experiments and three cross-sensor experiments have validated the effectiveness of our method which does not require access to source domain RS images. Furthermore, experiments on computer vision datasets show that our method can be extended to other fields as well. Our code will be available at: https://weixliu.github.io/ .
GuardFS: a File System for Integrated Detection and Mitigation of Linux-based Ransomware
Authors: Authors: Jan von der Assen, Chao Feng, Alberto Huertas Celdrán, Róbert Oleš, Gérôme Bovet, Burkhard Stiller
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2401.17917
Pdf link: https://arxiv.org/pdf/2401.17917
Abstract Although ransomware has received broad attention in media and research, this evolving threat vector still poses a systematic threat. Related literature has explored their detection using various approaches leveraging Machine and Deep Learning. While these approaches are effective in detecting malware, they do not answer how to use this intelligence to protect against threats, raising concerns about their applicability in a hostile environment. Solutions that focus on mitigation rarely explore how to prevent and not just alert or halt its execution, especially when considering Linux-based samples. This paper presents GuardFS, a file system-based approach to investigate the integration of detection and mitigation of ransomware. Using a bespoke overlay file system, data is extracted before files are accessed. Models trained on this data are used by three novel defense configurations that obfuscate, delay, or track access to the file system. The experiments on GuardFS test the configurations in a reactive setting. The results demonstrate that although data loss cannot be completely prevented, it can be significantly reduced. Usability and performance analysis demonstrate that the defense effectiveness of the configurations relates to their impact on resource consumption and usability.
CONCORD: Towards a DSL for Configurable Graph Code Representation
Authors: Authors: Mootez Saad, Tushar Sharma
Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.17967
Pdf link: https://arxiv.org/pdf/2401.17967
Abstract Deep learning is widely used to uncover hidden patterns in large code corpora. To achieve this, constructing a format that captures the relevant characteristics and features of source code is essential. Graph-based representations have gained attention for their ability to model structural and semantic information. However, existing tools lack flexibility in constructing graphs across different programming languages, limiting their use. Additionally, the output of these tools often lacks interoperability and results in excessively large graphs, making graph-based neural networks training slower and less scalable. We introduce CONCORD, a domain-specific language to build customizable graph representations. It implements reduction heuristics to reduce graphs' size complexity. We demonstrate its effectiveness in code smell detection as an illustrative use case and show that: first, CONCORD can produce code representations automatically per the specified configuration, and second, our heuristics can achieve comparable performance with significantly reduced size. CONCORD will help researchers a) create and experiment with customizable graph-based code representations for different software engineering tasks involving DL, b) reduce the engineering work to generate graph representations, c) address the issue of scalability in GNN models, and d) enhance the reproducibility of experiments in research through a standardized approach to code representation and analysis.
MelNet: A Real-Time Deep Learning Algorithm for Object Detection
Authors: Authors: Yashar Azadvatan, Murat Kurt
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.17972
Pdf link: https://arxiv.org/pdf/2401.17972
Abstract In this study, a novel deep learning algorithm for object detection, named MelNet, was introduced. MelNet underwent training utilizing the KITTI dataset for object detection. Following 300 training epochs, MelNet attained an mAP (mean average precision) score of 0.732. Additionally, three alternative models -YOLOv5, EfficientDet, and Faster-RCNN-MobileNetv3- were trained on the KITTI dataset and juxtaposed with MelNet for object detection. The outcomes underscore the efficacy of employing transfer learning in certain instances. Notably, preexisting models trained on prominent datasets (e.g., ImageNet, COCO, and Pascal VOC) yield superior results. Another finding underscores the viability of creating a new model tailored to a specific scenario and training it on a specific dataset. This investigation demonstrates that training MelNet exclusively on the KITTI dataset also surpasses EfficientDet after 150 epochs. Consequently, post-training, MelNet's performance closely aligns with that of other pre-trained models.
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
Authors: Authors: Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17981
Pdf link: https://arxiv.org/pdf/2401.17981
Abstract Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition models to improve fine-grained image understanding and reduce hallucination in responses. Our research investigates the embedding-based infusion of detection information, the impact of such infusion on the MLLMs' original abilities, and the interchangeability of detection models. We conduct systematic experiments with models such as LLaVA-1.5, DINO, and PaddleOCRv2, revealing that our approach not only refines MLLMs' performance in specific visual tasks but also maintains their original strengths. The resulting enhanced MLLMs outperform SOTA models on 9 out of 10 benchmarks, achieving an improvement of up to 12.99% on the normalized average score, marking a notable advancement in multimodal understanding. We release our codes to facilitate further exploration into the fine-grained multimodal dialogue capabilities of MLLMs.
Development and Adaptation of Robotic Vision in the Real-World: the Challenge of Door Detection
Authors: Authors: Michele Antonazzi, Matteo Luperto, N. Alberto Borghese, Nicola Basilico
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.17996
Pdf link: https://arxiv.org/pdf/2401.17996
Abstract Mobile service robots are increasingly prevalent in human-centric, real-world domains, operating autonomously in unconstrained indoor environments. In such a context, robotic vision plays a central role in enabling service robots to perceive high-level environmental features from visual observations. Despite the data-driven approaches based on deep learning push the boundaries of vision systems, applying these techniques to real-world robotic scenarios presents unique methodological challenges. Traditional models fail to represent the challenging perception constraints typical of service robots and must be adapted for the specific environment where robots finally operate. We propose a method leveraging photorealistic simulations that balances data quality and acquisition costs for synthesizing visual datasets from the robot perspective used to train deep architectures. Then, we show the benefits in qualifying a general detector for the target domain in which the robot is deployed, showing also the trade-off between the effort for obtaining new examples from such a setting and the performance gain. In our extensive experimental campaign, we focus on the door detection task (namely recognizing the presence and the traversability of doorways) that, in dynamic settings, is useful to infer the topology of the map. Our findings are validated in a real-world robot deployment, comparing prominent deep-learning models and demonstrating the effectiveness of our approach in practical settings.
Improved Scene Landmark Detection for Camera Localization
Authors: Authors: Tien Do, Sudipta N. Sinha
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.18083
Pdf link: https://arxiv.org/pdf/2401.18083
Abstract Camera localization methods based on retrieval, local feature matching, and 3D structure-based pose estimation are accurate but require high storage, are slow, and are not privacy-preserving. A method based on scene landmark detection (SLD) was recently proposed to address these limitations. It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks and computing camera pose from the associated 2D-3D correspondences. Although SLD outperformed existing learning-based approaches, it was notably less accurate than 3D structure-based methods. In this paper, we show that the accuracy gap was due to insufficient model capacity and noisy labels during training. To mitigate the capacity issue, we propose to split the landmarks into subgroups and train a separate network for each subgroup. To generate better training labels, we propose using dense reconstructions to estimate visibility of scene landmarks. Finally, we present a compact architecture to improve memory efficiency. Accuracy wise, our approach is on par with state of the art structure based methods on the INDOOR-6 dataset but runs significantly faster and uses less storage. Code and models can be found at https://github.com/microsoft/SceneLandmarkLocalization.
Keyword: face recognition

Unified Physical-Digital Face Attack Detection
Authors: Authors: Hao Fang, Ajian Liu, Haocheng Yuan, Junze Zheng, Dingheng Zeng, Yanhong Liu, Jiankang Deng, Sergio Escalera, Xiaoming Liu, Jun Wan, Zhen Lei
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17699
Pdf link: https://arxiv.org/pdf/2401.17699
Abstract Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks with ID consistency which means the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of $1,800$ participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 29,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection.
Keyword: augmentation

Arabic Tweet Act: A Weighted Ensemble Pre-Trained Transformer Model for Classifying Arabic Speech Acts on Twitter
Authors: Authors: Khadejaa Alshehri, Areej Alhothali, Nahed Alowidi
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17373
Pdf link: https://arxiv.org/pdf/2401.17373
Abstract Speech acts are a speakers actions when performing an utterance within a conversation, such as asking, recommending, greeting, or thanking someone, expressing a thought, or making a suggestion. Understanding speech acts helps interpret the intended meaning and actions behind a speakers or writers words. This paper proposes a Twitter dialectal Arabic speech act classification approach based on a transformer deep learning neural network. Twitter and social media, are becoming more and more integrated into daily life. As a result, they have evolved into a vital source of information that represents the views and attitudes of their users. We proposed a BERT based weighted ensemble learning approach to integrate the advantages of various BERT models in dialectal Arabic speech acts classification. We compared the proposed model against several variants of Arabic BERT models and sequence-based models. We developed a dialectal Arabic tweet act dataset by annotating a subset of a large existing Arabic sentiment analysis dataset (ASAD) based on six speech act categories. We also evaluated the models on a previously developed Arabic Tweet Act dataset (ArSAS). To overcome the class imbalance issue commonly observed in speech act problems, a transformer-based data augmentation model was implemented to generate an equal proportion of speech act categories. The results show that the best BERT model is araBERTv2-Twitter models with a macro-averaged F1 score and an accuracy of 0.73 and 0.84, respectively. The performance improved using a BERT-based ensemble method with a 0.74 and 0.85 averaged F1 score and accuracy on our dataset, respectively.
Anything in Any Scene: Photorealistic Video Object Insertion
Authors: Authors: Chen Bai, Zeman Shao, Guoxiang Zhang, Di Liang, Jie Yang, Zhuorui Zhang, Yujian Guo, Chengzhang Zhong, Yiqiao Qiu, Zhendong Wang, Yichen Guan, Xiaoyin Zheng, Tao Wang, Cheng Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2401.17509
Pdf link: https://arxiv.org/pdf/2401.17509
Abstract Realistic video simulation has shown significant potential across diverse applications, from virtual reality to film production. This is particularly true for scenarios where capturing videos in real-world settings is either impractical or expensive. Existing approaches in video simulation often fail to accurately model the lighting environment, represent the object geometry, or achieve high levels of photorealism. In this paper, we propose Anything in Any Scene, a novel and generic framework for realistic video simulation that seamlessly inserts any object into an existing dynamic video with a strong emphasis on physical realism. Our proposed general framework encompasses three key processes: 1) integrating a realistic object into a given scene video with proper placement to ensure geometric realism; 2) estimating the sky and environmental lighting distribution and simulating realistic shadows to enhance the light realism; 3) employing a style transfer network that refines the final video output to maximize photorealism. We experimentally demonstrate that Anything in Any Scene framework produces simulated videos of great geometric realism, lighting realism, and photorealism. By significantly mitigating the challenges associated with video data generation, our framework offers an efficient and cost-effective solution for acquiring high-quality videos. Furthermore, its applications extend well beyond video data augmentation, showing promising potential in virtual reality, video editing, and various other video-centric applications. Please check our project website https://anythinginanyscene.github.io for access to our project code and more high-resolution video results.
Graph Contrastive Learning with Cohesive Subgraph Awareness
Authors: Authors: Yucheng Wu, Leye Wang, Xiao Han, Han-Jia Ye
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.17580
Pdf link: https://arxiv.org/pdf/2401.17580
Abstract Graph contrastive learning (GCL) has emerged as a state-of-the-art strategy for learning representations of diverse graphs including social and biomedical networks. GCL widely uses stochastic graph topology augmentation, such as uniform node dropping, to generate augmented graphs. However, such stochastic augmentations may severely damage the intrinsic properties of a graph and deteriorate the following representation learning process. We argue that incorporating an awareness of cohesive subgraphs during the graph augmentation and learning processes has the potential to enhance GCL performance. To this end, we propose a novel unified framework called CTAug, to seamlessly integrate cohesion awareness into various existing GCL mechanisms. In particular, CTAug comprises two specialized modules: topology augmentation enhancement and graph learning enhancement. The former module generates augmented graphs that carefully preserve cohesion properties, while the latter module bolsters the graph encoder's ability to discern subgraph patterns. Theoretical analysis shows that CTAug can strictly improve existing GCL mechanisms. Empirical experiments verify that CTAug can achieve state-of-the-art performance for graph representation learning, especially for graphs with high degrees. The code is available at https://doi.org/10.5281/zenodo.10594093, or https://github.com/wuyucheng2002/CTAug.
Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks
Authors: Authors: Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, Zhiguo Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Arxiv link: https://arxiv.org/abs/2401.17585
Pdf link: https://arxiv.org/pdf/2401.17585
Abstract Current approaches of knowledge editing struggle to effectively propagate updates to interconnected facts. In this work, we delve into the barriers that hinder the appropriate propagation of updated knowledge within these models for accurate reasoning. To support our analysis, we introduce a novel reasoning-based benchmark -- ReCoE (Reasoning-based Counterfactual Editing dataset) -- which covers six common reasoning schemes in real world. We conduct a thorough analysis of existing knowledge editing techniques, including input augmentation, finetuning, and locate-and-edit. We found that all model editing methods show notably low performance on this dataset, especially in certain reasoning schemes. Our analysis over the chain-of-thought generation of edited models further uncover key reasons behind the inadequacy of existing knowledge editing methods from a reasoning standpoint, involving aspects on fact-wise editing, fact recall ability, and coherence in generation. We will make our benchmark publicly available.
Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2
Authors: Authors: Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.17619
Pdf link: https://arxiv.org/pdf/2401.17619
Abstract In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability, a constraint less common in text-to-speech (TTS). This study proposes a new approach to address this data scarcity. We utilize an existing singing voice synthesizer for data augmentation and apply precise manual tuning to reduce unnatural voice synthesis. Our development of two extensive singing voice corpora, ACE-Opencpop and KiSing-v2, facilitates large-scale, multi-singer voice synthesis. Utilizing pre-trained models derived from these corpora, we achieve notable improvements in voice quality, evident in both in-domain and out-of-domain scenarios. The corpora, pre-trained models, and their related training recipes are publicly available at Muskits-ESPnet (https://github.com/espnet/espnet).
ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search
Authors: Authors: Shuai Wang, Shengyao Zhuang, Bevan Koopman, Guido Zuccon
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17645
Pdf link: https://arxiv.org/pdf/2401.17645
Abstract Federated search, which involves integrating results from multiple independent search engines, will become increasingly pivotal in the context of Retrieval-Augmented Generation pipelines empowering LLM-based applications such as chatbots. These systems often distribute queries among various search engines, ranging from specialized (e.g., PubMed) to general (e.g., Google), based on the nature of user utterances. A critical aspect of federated search is resource selection - the selection of appropriate resources prior to issuing the query to ensure high-quality and rapid responses, and contain costs associated with calling the external search engines. However, current SOTA resource selection methodologies primarily rely on feature-based learning approaches. These methods often involve the labour intensive and expensive creation of training labels for each resource. In contrast, LLMs have exhibited strong effectiveness as zero-shot methods across NLP and IR tasks. We hypothesise that in the context of federated search LLMs can assess the relevance of resources without the need for extensive predefined labels or features. In this paper, we propose ReSLLM. Our ReSLLM method exploits LLMs to drive the selection of resources in federated search in a zero-shot setting. In addition, we devise an unsupervised fine tuning protocol, the Synthetic Label Augmentation Tuning (SLAT), where the relevance of previously logged queries and snippets from resources is predicted using an off-the-shelf LLM and then in turn used to fine-tune ReSLLM with respect to resource selection. Our empirical evaluation and analysis details the factors influencing the effectiveness of LLMs in this context. The results showcase the merits of ReSLLM for resource selection: not only competitive effectiveness in the zero-shot setting, but also obtaining large when fine-tuned using SLAT-protocol.
Harnessing Smartwatch Microphone Sensors for Cough Detection and Classification
Authors: Authors: Pranay Jaiswal, Haroon R. Lone
Subjects: Sound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2401.17738
Pdf link: https://arxiv.org/pdf/2401.17738
Abstract This study investigates the potential of using smartwatches with built-in microphone sensors for monitoring coughs and detecting various cough types. We conducted a study involving 32 participants and collected 9 hours of audio data in a controlled manner. Afterward, we processed this data using a structured approach, resulting in 223 positive cough samples. We further improved the dataset through augmentation techniques and employed a specialized 1D CNN model. This model achieved an impressive accuracy rate of 98.49% while non-walking and 98.2% while walking, showing smartwatches can detect cough. Moreover, our research successfully identified four distinct types of coughs using clustering techniques.
SwarmBrain: Embodied agent for real-time strategy game StarCraft II via large language models
Authors: Authors: Xiao Shao, Weifu Jiang, Fei Zuo, Mengqing Liu
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17749
Pdf link: https://arxiv.org/pdf/2401.17749
Abstract Large language models (LLMs) have recently garnered significant accomplishments in various exploratory tasks, even surpassing the performance of traditional reinforcement learning-based methods that have historically dominated the agent-based field. The purpose of this paper is to investigate the efficacy of LLMs in executing real-time strategy war tasks within the StarCraft II gaming environment. In this paper, we introduce SwarmBrain, an embodied agent leveraging LLM for real-time strategy implementation in the StarCraft II game environment. The SwarmBrain comprises two key components: 1) a Overmind Intelligence Matrix, powered by state-of-the-art LLMs, is designed to orchestrate macro-level strategies from a high-level perspective. This matrix emulates the overarching consciousness of the Zerg intelligence brain, synthesizing strategic foresight with the aim of allocating resources, directing expansion, and coordinating multi-pronged assaults. 2) a Swarm ReflexNet, which is agile counterpart to the calculated deliberation of the Overmind Intelligence Matrix. Due to the inherent latency in LLM reasoning, the Swarm ReflexNet employs a condition-response state machine framework, enabling expedited tactical responses for fundamental Zerg unit maneuvers. In the experimental setup, SwarmBrain is in control of the Zerg race in confrontation with an Computer-controlled Terran adversary. Experimental results show the capacity of SwarmBrain to conduct economic augmentation, territorial expansion, and tactical formulation, and it shows the SwarmBrain is capable of achieving victory against Computer players set at different difficulty levels.
Distillation Enhanced Time Series Forecasting Network with Momentum Contrastive Learning
Authors: Authors: Haozhi Gao, Qianqian Ren, Jinbao Li
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2401.17802
Pdf link: https://arxiv.org/pdf/2401.17802
Abstract Contrastive representation learning is crucial in time series analysis as it alleviates the issue of data noise and incompleteness as well as sparsity of supervision signal. However, existing constrastive learning frameworks usually focus on intral-temporal features, which fails to fully exploit the intricate nature of time series data. To address this issue, we propose DE-TSMCL, an innovative distillation enhanced framework for long sequence time series forecasting. Specifically, we design a learnable data augmentation mechanism which adaptively learns whether to mask a timestamp to obtain optimized sub-sequences. Then, we propose a contrastive learning task with momentum update to explore inter-sample and intra-temporal correlations of time series to learn the underlying structure feature on the unlabeled time series. Meanwhile, we design a supervised task to learn more robust representations and facilitate the contrastive learning process. Finally, we jointly optimize the above two tasks. By developing model loss from multiple tasks, we can learn effective representations for downstream forecasting task. Extensive experiments, in comparison with state-of-the-arts, well demonstrate the effectiveness of DE-TSMCL, where the maximum improvement can reach to 27.3%.
PALoc: Advancing SLAM Benchmarking with Prior-Assisted 6-DoF Trajectory Generation and Uncertainty Estimation
Authors: Authors: Xiangcheng Hu, Linwei Zheng, Jin Wu, Ruoyu Geng, Yang Yu, Hexiang Wei, Xiaoyu Tang, Lujia Wang, Jianhao Jiao, Ming Liu
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2401.17826
Pdf link: https://arxiv.org/pdf/2401.17826
Abstract Accurately generating ground truth (GT) trajectories is essential for Simultaneous Localization and Mapping (SLAM) evaluation, particularly under varying environmental conditions. This study introduces a systematic approach employing a prior map-assisted framework for generating dense six-degree-of-freedom (6-DoF) GT poses for the first time, enhancing the fidelity of both indoor and outdoor SLAM datasets. Our method excels in handling degenerate and stationary conditions frequently encountered in SLAM datasets, thereby increasing robustness and precision. A significant aspect of our approach is the detailed derivation of covariances within the factor graph, enabling an in-depth analysis of pose uncertainty propagation. This analysis crucially contributes to demonstrating specific pose uncertainties and enhancing trajectory reliability from both theoretical and empirical perspectives. Additionally, we provide an open-source toolbox (https://github.com/JokerJohn/Cloud_Map_Evaluation) for map evaluation criteria, facilitating the indirect assessment of overall trajectory precision. Experimental results show at least a 30\% improvement in map accuracy and a 20\% increase in direct trajectory accuracy compared to the Iterative Closest Point (ICP) \cite{sharp2002icp} algorithm across diverse campus environments, with substantially enhanced robustness. Our open-source solution (https://github.com/JokerJohn/PALoc), extensively applied in the FusionPortable\cite{Jiao2022Mar} dataset, is geared towards SLAM benchmark dataset augmentation and represents a significant advancement in SLAM evaluations.
Optimizing contrastive learning for cortical folding pattern detection
Authors: Authors: Aymeric Gaudin (1), Louise Guillon (1), Clara Fischer (1), Arnaud Cachia (2), Denis Rivière (1), Jean-François Mangin (1), Joël Chavas (1) ((1) Neurospin, Gif-sur-Yvette, France, (2) LaPsyDé, Laboratoire A.Binet-Sorbonne, Paris, France)
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.18035
Pdf link: https://arxiv.org/pdf/2401.18035
Abstract The human cerebral cortex has many bumps and grooves called gyri and sulci. Even though there is a high inter-individual consistency for the main cortical folds, this is not the case when we examine the exact shapes and details of the folding patterns. Because of this complexity, characterizing the cortical folding variability and relating them to subjects' behavioral characteristics or pathologies is still an open scientific problem. Classical approaches include labeling a few specific patterns, either manually or semi-automatically, based on geometric distances, but the recent availability of MRI image datasets of tens of thousands of subjects makes modern deep-learning techniques particularly attractive. Here, we build a self-supervised deep-learning model to detect folding patterns in the cingulate region. We train a contrastive self-supervised model (SimCLR) on both Human Connectome Project (1101 subjects) and UKBioBank (21070 subjects) datasets with topological-based augmentations on the cortical skeletons, which are topological objects that capture the shape of the folds. We explore several backbone architectures (convolutional network, DenseNet, and PointNet) for the SimCLR. For evaluation and testing, we perform a linear classification task on a database manually labeled for the presence of the "double-parallel" folding pattern in the cingulate region, which is related to schizophrenia characteristics. The best model, giving a test AUC of 0.76, is a convolutional network with 6 layers, a 10-dimensional latent space, a linear projection head, and using the branch-clipping augmentation. This is the first time that a self-supervised deep learning model has been applied to cortical skeletons on such a large dataset and quantitatively evaluated. We can now envisage the next step: applying it to other brain regions to detect other biomarkers.
Rank Supervised Contrastive Learning for Time Series Classification
Authors: Authors: Qianying Ren, Dongsheng Luo, Dongjin Song
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2401.18057
Pdf link: https://arxiv.org/pdf/2401.18057
Abstract Recently, various contrastive learning techniques have been developed to categorize time series data and exhibit promising performance. A general paradigm is to utilize appropriate augmentations and construct feasible positive samples such that the encoder can yield robust and discriminative representations by mapping similar data points closer together in the feature space while pushing dissimilar data points farther apart. Despite its efficacy, the fine-grained relative similarity (e.g., rank) information of positive samples is largely ignored, especially when labeled samples are limited. To this end, we present Rank Supervised Contrastive Learning (RankSCL) to perform time series classification. Different from conventional contrastive learning frameworks, RankSCL augments raw data in a targeted way in the embedding space and adopts certain filtering rules to select more informative positive and negative pairs of samples. Moreover, a novel rank loss is developed to assign different weights for different levels of positive samples, enable the encoder to extract the fine-grained information of the same class, and produce a clear boundary among different classes. Thoroughly empirical studies on 128 UCR datasets and 30 UEA datasets demonstrate that the proposed RankSCL can achieve state-of-the-art performance compared to existing baseline methods.

LeeKyungwook / get-arxiv-noti

New submissions for Thu, 1 Feb 24 #959

Keyword: detection

Detecting mental disorder on social media: a ChatGPT-augmented explainable approach

AdvGPS: Adversarial GPS for Multi-Agent Perception Attack

Effective Multi-Stage Training Model For Edge Computing Devices In Intrusion Detection

Local Feature Matching Using Deep Learning: A Survey

SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization

Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

Assertion Detection Large Language Model In-context Learning LoRA Fine-tuning

Social Robot Navigation with Adaptive Proxemics Based on Emotions

Unified Physical-Digital Face Attack Detection

LoRec: Large Language Model for Robust Sequential Recommendation against Poisoning Attacks

Haris: an Advanced Autonomous Mobile Robot for Smart Parking Assistance

PF-GNN: Differentiable particle filtering based approximation of universal graph representations

AI-enabled Cyber-Physical In-Orbit Factory -- AI approaches based on digital twin technology for robotic small satellite production

QTFlow: Quantitative Timing-Sensitive Information Flow for Security-Aware Hardware Design on RTL

Do Object Detection Localization Errors Affect Human Performance and Trust?

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Semantic Anything in 3D Gaussians

AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

SubPipe: A Submarine Pipeline Inspection Dataset for Segmentation and Visual-inertial Localization

Source-free Domain Adaptive Object Detection in Remote Sensing Images

GuardFS: a File System for Integrated Detection and Mitigation of Linux-based Ransomware

CONCORD: Towards a DSL for Configurable Graph Code Representation

MelNet: A Real-Time Deep Learning Algorithm for Object Detection

Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study

Development and Adaptation of Robotic Vision in the Real-World: the Challenge of Door Detection

Improved Scene Landmark Detection for Camera Localization

Keyword: face recognition

Unified Physical-Digital Face Attack Detection

Keyword: augmentation

Arabic Tweet Act: A Weighted Ensemble Pre-Trained Transformer Model for Classifying Arabic Speech Acts on Twitter

Anything in Any Scene: Photorealistic Video Object Insertion

Graph Contrastive Learning with Cohesive Subgraph Awareness

Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2

ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search

Harnessing Smartwatch Microphone Sensors for Cough Detection and Classification

SwarmBrain: Embodied agent for real-time strategy game StarCraft II via large language models

Distillation Enhanced Time Series Forecasting Network with Momentum Contrastive Learning

PALoc: Advancing SLAM Benchmarking with Prior-Assisted 6-DoF Trajectory Generation and Uncertainty Estimation

Optimizing contrastive learning for cortical folding pattern detection

Rank Supervised Contrastive Learning for Time Series Classification