New submissions for Tuesday, 11 June 2024 (showing 666 of 666 entries )

Keyword: detection

Title:

      Multi-sensor Intrusion Detection System

Authors: Victor Arinde, Liberty Idowu
Subjects: Subjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Security, defined as protection against external threats, is a critical concern for homes and offices. Intrusion, characterized by unauthorized access, presents a significant challenge to maintaining security. This research aims to address this issue by designing and implementing an automated intrusion detection system utilizing a combination of sensors and communication technologies. The research introduced an automated intrusion detection system for homes and offices, combining sensors such as a PIR sensor for detecting unauthorized motion, magnetic switches for unauthorized entry detection, and a GSM module for notifying property owners. Employing the ATmega328P microcontroller, sensor data is analysed to generate early intrusion alerts, prompting phone call notifications via the GSM module. Practical implementation involved breadboarding, soldering, and rigorous testing, ensuring proper functionality under real-world conditions. The implemented intrusion detection system effectively utilizes magnetic switches and a Passive Infrared (PIR) sensor to detect unauthorized entry and motion within the premises, respectively. Upon detection, the system promptly analyses the situation and alerts the property owner via phone call, enabling swift response measures. This real-time notification system enhances proactive security management, minimizing the risk of further intrusion and ensuring the safety of the property. The multi-sensor intrusion detection system, incorporating PIR sensors, magnetic switches, and a GSM-based phone call gateway, effectively alerts property owners of unauthorized intrusions in real-time. Demonstrating its efficacy through rigorous testing, the system offers enhanced security for both residential and commercial environments.
Title:
```
  Fight Scene Detection for Movie Highlight Generation System
```
Authors: Aryan Mathur
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this paper of a research based project, using Bidirectional Long Short-Term Memory (BiLSTM) networks, we provide a novel Fight Scene Detection (FSD) model which can be used for Movie Highlight Generation Systems (MHGS) based on deep learning and Neural Networks . Movies usually have Fight Scenes to keep the audience amazed. For trailer generation, or any other application of Highlight generation, it is very tidious to first identify all such scenes manually and then compile them to generate a highlight serving the purpose. Our proposed FSD system utilises temporal characteristics of the movie scenes and thus is capable to automatically identify fight scenes. Thereby helping in the effective production of captivating movie highlights. We observe that the proposed solution features 93.5% accuracy and is higher than 2D CNN with Hough Forests which being 92% accurate and is significantly higher than 3D CNN which features an accuracy of 65%.
Title:
```
  Improving Logits-based Detector without Logits from Black-box LLMs
```
Authors: Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, zhiqiang xu, Yao Li, Haifeng Chen, Wei Cheng, Dongkuan Xu
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The advent of Large Language Models (LLMs) has revolutionized text generation, producing outputs that closely mimic human writing. This blurring of lines between machine- and human-written text presents new challenges in distinguishing one from the other a task further complicated by the frequent updates and closed nature of leading proprietary LLMs. Traditional logits-based detection methods leverage surrogate models for identifying LLM-generated content when the exact logits are unavailable from black-box LLMs. However, these methods grapple with the misalignment between the distributions of the surrogate and the often undisclosed target models, leading to performance degradation, particularly with the introduction of new, closed-source models. Furthermore, while current methodologies are generally effective when the source model is identified, they falter in scenarios where the model version remains unknown, or the test set comprises outputs from various source models. To address these limitations, we present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations with minimal training investment. By leveraging corpus samples from publicly accessible outputs of advanced models such as ChatGPT, GPT-4 and Claude-3, DALD fine-tunes surrogate models to synchronize with unknown source model distributions effectively.
Title:
```
  Blended Bots: Infiltration through Identity Deception on Social Media
```
Authors: Samantha C. Phillips, Lynnette Hui Xian Ng, Kathleen M. Carley
Subjects: Subjects: Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Bots are automated social media users that can be used to amplify (mis)information and sow harmful discourse. In order to effectively influence users, bots can be generated to reproduce human user behavior. Indeed, people tend to trust information coming from users with profiles that fit roles they expect to exist, such as users with gender role stereotypes. In this work, we examine differences in the types of identities in profiles of human and bot accounts with a focus on combinations of identities that represent gender role stereotypes. We find that some types of identities differentiate between human and bot profiles, confirming this approach can be a useful in distinguishing between human and bot accounts on social media. However, contrary to our expectations, we reveal that gender bias is expressed more in human accounts than bots overall. Despite having less gender bias overall, we provide examples of identities with strong associations with gender identities in bot profiles, such as those related to technology, finance, sports, and horoscopes. Finally, we discuss implications for designing constructive social media bot detection training materials.
Title:
```
  Active Islanding Detection Using Pulse Compression Probing
```
Authors: Nicholas Piaquadio, N. Eva Wu, Morteza Sarailoo
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract An islanding detection scheme is developed using pulse compression probing (PCP). A state space system realization is taken from the probing output. The nu-gap metric is applied to compare the measured system to fully intact system and classify it as islanded, or grid-connected. The designed detector displays fast operation, accurate islanding detection results under varying grid condition, and is physically implementable at the terminals of an inverter. The method is verified via electro-magnetic transient (EMT) simulation on a modified IEEE 34 bus test system with randomized loads and simultaneous probing at three independent solar plants, with the probing signal directly implemented into the logic of a switching inverter model.
Title:
```
  Autonomous Robotic Assembly: From Part Singulation to Precise Assembly
```
Authors: Kei Ota Devesh K. Jha, Siddarth Jain, Bill Yerazunis, Radu Corcodel, Yash Shukla, Antonia Bronars, Diego Romeres
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Imagine a robot that can assemble a functional product from the individual parts presented in any configuration to the robot. Designing such a robotic system is a complex problem which presents several open challenges. To bypass these challenges, the current generation of assembly systems is built with a lot of system integration effort to provide the structure and precision necessary for assembly. These systems are mostly responsible for part singulation, part kitting, and part detection, which is accomplished by intelligent system design. In this paper, we present autonomous assembly of a gear box with minimum requirements on structure. The assembly parts are randomly placed in a two-dimensional work environment for the robot. The proposed system makes use of several different manipulation skills such as sliding for grasping, in-hand manipulation, and insertion to assemble the gear box. All these tasks are run in a closed-loop fashion using vision, tactile, and Force-Torque (F/T) sensors. We perform extensive hardware experiments to show the robustness of the proposed methods as well as the overall system. See supplementary video at this https URL.
Title:
```
  MemeGuard: An LLM and VLM-based Framework for Advancing Content Moderation via Meme Intervention
```
Authors: Prince Jha, Raghav Jain, Konika Mandal, Aman Chadha, Sriparna Saha, Pushpak Bhattacharyya
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In the digital world, memes present a unique challenge for content moderation due to their potential to spread harmful content. Although detection methods have improved, proactive solutions such as intervention are still limited, with current research focusing mostly on text-based content, neglecting the widespread influence of multimodal content like memes. Addressing this gap, we present \textit{MemeGuard}, a comprehensive framework leveraging Large Language Models (LLMs) and Visual Language Models (VLMs) for meme intervention. \textit{MemeGuard} harnesses a specially fine-tuned VLM, \textit{VLMeme}, for meme interpretation, and a multimodal knowledge selection and ranking mechanism (\textit{MKS}) for distilling relevant knowledge. This knowledge is then employed by a general-purpose LLM to generate contextually appropriate interventions. Another key contribution of this work is the \textit{\textbf{I}ntervening} \textit{\textbf{C}yberbullying in \textbf{M}ultimodal \textbf{M}emes (ICMM)} dataset, a high-quality, labeled dataset featuring toxic memes and their corresponding human-annotated interventions. We leverage \textit{ICMM} to test \textit{MemeGuard}, demonstrating its proficiency in generating relevant and effective responses to toxic memes.
Title:
```
  RAPID: Robust APT Detection and Investigation Using Context-Aware Deep Learning
```
Authors: Yonatan Amaru, Prasanna Wudali, Yuval Elovici, Asaf Shabtai
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Advanced persistent threats (APTs) pose significant challenges for organizations, leading to data breaches, financial losses, and reputational damage. Existing provenance-based approaches for APT detection often struggle with high false positive rates, a lack of interpretability, and an inability to adapt to evolving system behavior. We introduce RAPID, a novel deep learning-based method for robust APT detection and investigation, leveraging context-aware anomaly detection and alert tracing. By utilizing self-supervised sequence learning and iteratively learned embeddings, our approach effectively adapts to dynamic system behavior. The use of provenance tracing both enriches the alerts and enhances the detection capabilities of our approach. Our extensive evaluation demonstrates RAPID's effectiveness and computational efficiency in real-world scenarios. In addition, RAPID achieves higher precision and recall than state-of-the-art methods, significantly reducing false positives. RAPID integrates contextual information and facilitates a smooth transition from detection to investigation, providing security teams with detailed insights to efficiently address APT threats.
Title:
```
  Spiking Neural Networks with Consistent Mapping Relations Allow High-Accuracy Inference
```
Authors: Yang Li, Xiang He, Qingqun Kong, Yi Zeng
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Spike-based neuromorphic hardware has demonstrated substantial potential in low energy consumption and efficient inference. However, the direct training of deep spiking neural networks is challenging, and conversion-based methods still require substantial time delay owing to unresolved conversion errors. We determine that the primary source of the conversion errors stems from the inconsistency between the mapping relationship of traditional activation functions and the input-output dynamics of spike neurons. To counter this, we introduce the Consistent ANN-SNN Conversion (CASC) framework. It includes the Consistent IF (CIF) neuron model, specifically contrived to minimize the influence of the stable point's upper bound, and the wake-sleep conversion (WSC) method, synergistically ensuring the uniformity of neuron behavior. This method theoretically achieves a loss-free conversion, markedly diminishing time delays and improving inference performance in extensive classification and object detection tasks. Our approach offers a viable pathway toward more efficient and effective neuromorphic systems.
Title:
```
  Evaluation of Posits for Spectral Analysis Using a Software-Defined Dataflow Architecture
```
Authors: Sameer Deshmukh, Daniel Khankin, William Killian, John Gustafson, Elad Raz
Subjects: Subjects: Hardware Architecture (cs.AR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Spectral analysis plays an important role in detection of damage in structures and deep learning. The choice of a floating-point format plays a crucial role in determining the accuracy and performance of spectral analysis. The IEEE Std 754\textsuperscript{TM} floating-point format (IEEE~754 for short) is supported by most major hardware vendors for ``normal'' floats. However, it has several limitations. Previous work has attempted to evaluate posit format with respect to accuracy and performance. The accuracy of the posit has been established over IEEE~754 for a variety of applications. For example, our analysis of the Fast Fourier Transform shows 2x better accuracy when using a 32-bit posit vs. a 32-bit IEEE754 format. For spectral analysis, 32-bit posits are substantially more accurate than 32-bit IEEE~754 floats. Although posit has shown better accuracy than IEEE~754, a fair evaluation of posit with IEEE~754 format using a real hardware implementation has been lacking so far. A software simulation of posit format on an x86 CPU is about $\mathbf{69.3\times}$ slower than native IEEE~754 hardware for normal floats for a Fast Fourier Transform (FFT) of $\mathbf{2^{28}}$ points. We propose the use of a software-defined dataflow architecture to evaluate performance and accuracy of posits in spectral analysis. Our dataflow architecture uses reconfigurable logical elements that express algorithms using only integer operations. Our architecture does not have an FPU, and we express both IEEE~754 and posit arithmetic using the same integer operations within the hardware. On our dataflow architecture, the posit format is only $\mathbf{1.8\times}$ slower than IEEE~754 for a Fast Fourier Transform (FFT) of $\mathbf{2^{28}\approx 268}$ million points. With this implementation, we empirically propose a new lower bound for the performance of posit compared to IEEE~754 format.
Title:
```
  Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes
```
Authors: Hao Zhang, Shuaijie Zhang, Renbin Zou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Data augmentation refers to the process of applying a series of transformations or expansions to original data to generate new samples, thereby increasing the diversity and quantity of the data, effectively improving the performance and robustness of models. As a common data augmentation method, Mosaic data augmentation technique stitches multiple images together to increase the diversity and complexity of training data, thereby reducing the risk of overfitting. Although Mosaic data augmentation achieves excellent results in general detection tasks by stitching images together, it still has certain limitations for specific detection tasks. This paper addresses the challenge of detecting a large number of densely distributed small objects in aerial images by proposing the Select-Mosaic data augmentation method, which is improved with a fine-grained region selection strategy. The improved Select-Mosaic method demonstrates superior performance in handling dense small object detection tasks, significantly enhancing the accuracy and stability of detection models. Code is available at this https URL.
Title:
```
  Recent advancements in computational morphology : A comprehensive survey
```
Authors: Jatayu Baxi, Brijesh Bhatt
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Computational morphology handles the language processing at the word level. It is one of the foundational tasks in the NLP pipeline for the development of higher level NLP applications. It mainly deals with the processing of words and word forms. Computational Morphology addresses various sub problems such as morpheme boundary detection, lemmatization, morphological feature tagging, morphological reinflection etc. In this paper, we present exhaustive survey of the methods for developing computational morphology related tools. We survey the literature in the chronological order starting from the conventional methods till the recent evolution of deep neural network based approaches. We also review the existing datasets available for this task across the languages. We discuss about the effectiveness of neural model compared with the traditional models and present some unique challenges associated with building the computational morphology tools. We conclude by discussing some recent and open research issues in this field.
Title:
```
  Unsupervised learning of Data-driven Facial Expression Coding System (DFECS) using keypoint tracking
```
Authors: Shivansh Chandra Tripathi, Rahul Garg
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The development of existing facial coding systems, such as the Facial Action Coding System (FACS), relied on manual examination of facial expression videos for defining Action Units (AUs). To overcome the labor-intensive nature of this process, we propose the unsupervised learning of an automated facial coding system by leveraging computer-vision-based facial keypoint tracking. In this novel facial coding system called the Data-driven Facial Expression Coding System (DFECS), the AUs are estimated by applying dimensionality reduction to facial keypoint movements from a neutral frame through a proposed Full Face Model (FFM). FFM employs a two-level decomposition using advanced dimensionality reduction techniques such as dictionary learning (DL) and non-negative matrix factorization (NMF). These techniques enhance the interpretability of AUs by introducing constraints such as sparsity and positivity to the encoding matrix. Results show that DFECS AUs estimated from the DISFA dataset can account for an average variance of up to 91.29 percent in test datasets (CK+ and BP4D-Spontaneous) and also surpass the variance explained by keypoint-based equivalents of FACS AUs in these datasets. Additionally, 87.5 percent of DFECS AUs are interpretable, i.e., align with the direction of facial muscle movements. In summary, advancements in automated facial coding systems can accelerate facial expression analysis across diverse fields such as security, healthcare, and entertainment. These advancements offer numerous benefits, including enhanced detection of abnormal behavior, improved pain analysis in healthcare settings, and enriched emotion-driven interactions. To facilitate further research, the code repository of DFECS has been made publicly accessible.
Title:
```
  Novel Approach to Intrusion Detection: Introducing GAN-MSCNN-BILSTM with LIME Predictions
```
Authors: Asmaa Benchama, Khalid Zebbara
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract This paper introduces an innovative intrusion detection system that harnesses Generative Adversarial Networks (GANs), Multi-Scale Convolutional Neural Networks (MSCNNs), and Bidirectional Long Short-Term Memory (BiLSTM) networks, supplemented by Local Interpretable Model-Agnostic Explanations (LIME) for interpretability. Employing a GAN, the system generates realistic network traffic data, encompassing both normal and attack patterns. This synthesized data is then fed into an MSCNN-BiLSTM architecture for intrusion detection. The MSCNN layer extracts features from the network traffic data at different scales, while the BiLSTM layer captures temporal dependencies within the traffic sequences. Integration of LIME allows for explaining the model's decisions. Evaluation on the Hogzilla dataset, a standard benchmark, showcases an impressive accuracy of 99.16\% for multi-class classification and 99.10\% for binary classification, while ensuring interpretability through LIME. This fusion of deep learning and interpretability presents a promising avenue for enhancing intrusion detection systems by improving transparency and decision support in network security.
Title:
```
  A Novel Generative AI-Based Framework for Anomaly Detection in Multicast Messages in Smart Grid Communications
```
Authors: Aydin Zaboli, Seong Lok Choi, Tai-Jin Song, Junho Hong
Subjects: Subjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Cybersecurity breaches in digital substations can pose significant challenges to the stability and reliability of power system operations. To address these challenges, defense and mitigation techniques are required. Identifying and detecting anomalies in information and communication technology (ICT) is crucial to ensure secure device interactions within digital substations. This paper proposes a task-oriented dialogue (ToD) system for anomaly detection (AD) in datasets of multicast messages e.g., generic object oriented substation event (GOOSE) and sampled value (SV) in digital substations using large language models (LLMs). This model has a lower potential error and better scalability and adaptability than a process that considers the cybersecurity guidelines recommended by humans, known as the human-in-the-loop (HITL) process. Also, this methodology significantly reduces the effort required when addressing new cyber threats or anomalies compared with machine learning (ML) techniques, since it leaves the models complexity and precision unaffected and offers a faster implementation. These findings present a comparative assessment, conducted utilizing standard and advanced performance evaluation metrics for the proposed AD framework and the HITL process. To generate and extract datasets of IEC 61850 communications, a hardware-in-the-loop (HIL) testbed was employed.
Title:
```
  ThatiAR: Subjectivity Detection in Arabic News Sentences
```
Authors: Reem Suwaileh, Maram Hasanain, Fatema Hubail, Wajdi Zaghouani, Firoj Alam
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Detecting subjectivity in news sentences is crucial for identifying media bias, enhancing credibility, and combating misinformation by flagging opinion-based content. It provides insights into public sentiment, empowers readers to make informed decisions, and encourages critical thinking. While research has developed methods and systems for this purpose, most efforts have focused on English and other high-resourced languages. In this study, we present the first large dataset for subjectivity detection in Arabic, consisting of ~3.6K manually annotated sentences, and GPT-4o based explanation. In addition, we included instructions (both in English and Arabic) to facilitate LLM based fine-tuning. We provide an in-depth analysis of the dataset, annotation process, and extensive benchmark results, including PLMs and LLMs. Our analysis of the annotation process highlights that annotators were strongly influenced by their political, cultural, and religious backgrounds, especially at the beginning of the annotation process. The experimental results suggest that LLMs with in-context learning provide better performance. We aim to release the dataset and resources for the community.
Title:
```
  NYU CTF Dataset: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
```
Authors: Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, Haoran Xi, Kimberly Milner, Boyuan Chen, Max Yin, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique
Subjects: Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized dataset, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our dataset open source to public this https URL along with our playground automated framework this https URL.
Title:
```
  Deep Learning to Predict Glaucoma Progression using Structural Changes in the Eye
```
Authors: Sayan Mandal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Glaucoma is a chronic eye disease characterized by optic neuropathy, leading to irreversible vision loss. It progresses gradually, often remaining undiagnosed until advanced stages. Early detection is crucial to monitor atrophy and develop treatment strategies to prevent further vision impairment. Data-centric methods have enabled computer-aided algorithms for precise glaucoma diagnosis. In this study, we use deep learning models to identify complex disease traits and progression criteria, detecting subtle changes indicative of glaucoma. We explore the structure-function relationship in glaucoma progression and predict functional impairment from structural eye deterioration. We analyze statistical and machine learning methods, including deep learning techniques with optical coherence tomography (OCT) scans for accurate progression prediction. Addressing challenges like age variability, data imbalances, and noisy labels, we develop novel semi-supervised time-series algorithms: 1. Weakly-Supervised Time-Series Learning: We create a CNN-LSTM model to encode spatiotemporal features from OCT scans. This approach uses age-related progression and positive-unlabeled data to establish robust pseudo-progression criteria, bypassing gold-standard labels. 2. Semi-Supervised Time-Series Learning: Using labels from Guided Progression Analysis (GPA) in a contrastive learning scheme, the CNN-LSTM architecture learns from potentially mislabeled data to improve prediction accuracy. Our methods outperform conventional and state-of-the-art techniques.
Title:
```
  Heart Sound Segmentation Using Deep Learning Techniques
```
Authors: Manas Madine
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Heart disease remains a leading cause of mortality worldwide. Auscultation, the process of listening to heart sounds, can be enhanced through computer-aided analysis using Phonocardiogram (PCG) signals. This paper presents a novel approach for heart sound segmentation and classification into S1 (LUB) and S2 (DUB) sounds. We employ FFT-based filtering, dynamic programming for event detection, and a Siamese network for robust classification. Our method demonstrates superior performance on the PASCAL heart sound dataset compared to existing approaches.
Title:
```
  SRC-Net: Bi-Temporal Spatial Relationship Concerned Network for Change Detection
```
Authors: Hongjia Chen, Xin Xu, Fangling Pu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Change detection (CD) in remote sensing imagery is a crucial task with applications in environmental monitoring, urban development, and disaster management. CD involves utilizing bi-temporal images to identify changes over time. The bi-temporal spatial relationships between features at the same location at different times play a key role in this process. However, existing change detection networks often do not fully leverage these spatial relationships during bi-temporal feature extraction and fusion. In this work, we propose SRC-Net: a bi-temporal spatial relationship concerned network for CD. The proposed SRC-Net includes a Perception and Interaction Module that incorporates spatial relationships and establishes a cross-branch perception mechanism to enhance the precision and robustness of feature extraction. Additionally, a Patch-Mode joint Feature Fusion Module is introduced to address information loss in current methods. It considers different change modes and concerns about spatial relationships, resulting in more expressive fusion features. Furthermore, we construct a novel network using these two relationship concerned modules and conducted experiments on the LEVIR-CD and WHU Building datasets. The experimental results demonstrate that our network outperforms state-of-the-art (SOTA) methods while maintaining a modest parameter count. We believe our approach sets a new paradigm for change detection and will inspire further advancements in the field. The code and models are publicly available at this https URL.
Title:
```
  Region of Interest Loss for Anonymizing Learned Image Compression
```
Authors: Christoph Liebender, Ranulfo Bezerra, Kazunori Ohno, Satoshi Tadokoro
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The use of AI in public spaces continually raises concerns about privacy and the protection of sensitive data. An example is the deployment of detection and recognition methods on humans, where images are provided by surveillance cameras. This results in the acquisition of great amounts of sensitive data, since the capture and transmission of images taken by such cameras happens unaltered, for them to be received by a server on the network. However, many applications do not explicitly require the identity of a given person in a scene; An anonymized representation containing information of the person's position while preserving the context of them in the scene suffices. We show how using a customized loss function on region of interests (ROI) can achieve sufficient anonymization such that human faces become unrecognizable while persons are kept detectable, by training an end-to-end optimized autoencoder for learned image compression that utilizes the flexibility of the learned analysis and reconstruction transforms for the task of mutating parts of the compression result. This approach enables compression and anonymization in one step on the capture device, instead of transmitting sensitive, nonanonymized data over the network. Additionally, we evaluate how this anonymization impacts the average precision of pre-trained foundation models on detecting faces (MTCNN) and humans (YOLOv8) in comparison to non-ANN based methods, while considering compression rate and latency.
Title:
```
  A DeNoising FPN With Transformer R-CNN for Tiny Object Detection
```
Authors: Hou-I Liu, Yu-Wen Tseng, Kai-Cheng Chang, Pin-Jyun Wang, Hong-Han Shuai, Wen-Huang Cheng
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Despite notable advancements in the field of computer vision, the precise detection of tiny objects continues to pose a significant challenge, largely owing to the minuscule pixel representation allocated to these objects in imagery data. This challenge resonates profoundly in the domain of geoscience and remote sensing, where high-fidelity detection of tiny objects can facilitate a myriad of applications ranging from urban planning to environmental monitoring. In this paper, we propose a new framework, namely, DeNoising FPN with Trans R-CNN (DNTR), to improve the performance of tiny object detection. DNTR consists of an easy plug-in design, DeNoising FPN (DN-FPN), and an effective Transformer-based detector, Trans R-CNN. Specifically, feature fusion in the feature pyramid network is important for detecting multiscale objects. However, noisy features may be produced during the fusion process since there is no regularization between the features of different scales. Therefore, we introduce a DN-FPN module that utilizes contrastive learning to suppress noise in each level's features in the top-down path of FPN. Second, based on the two-stage framework, we replace the obsolete R-CNN detector with a novel Trans R-CNN detector to focus on the representation of tiny objects with self-attention. Experimental results manifest that our DNTR outperforms the baselines by at least 17.4\% in terms of APvt on the AI-TOD dataset and 9.6\% in terms of AP on the VisDrone dataset, respectively. Our code will be available at this \href{this https URL}{this https URL}.
Title:
```
  Vision Mamba: Cutting-Edge Classification of Alzheimer's Disease with 3D MRI Scans
```
Authors: Muthukumar K A, Amit Gurung, Priya Ranjan
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Classifying 3D MRI images for early detection of Alzheimer's disease is a critical task in medical imaging. Traditional approaches using Convolutional Neural Networks (CNNs) and Transformers face significant challenges in this domain. CNNs, while effective in capturing local spatial features, struggle with long-range dependencies and often require extensive computational resources for high-resolution 3D data. Transformers, on the other hand, excel in capturing global context but suffer from quadratic complexity in inference time and require substantial memory, making them less efficient for large-scale 3D MRI data. To address these limitations, we propose the use of Vision Mamba, an advanced model based on State Space Models (SSMs), for the classification of 3D MRI images to detect Alzheimer's disease. Vision Mamba leverages dynamic state representations and the selective scan algorithm, allowing it to efficiently capture and retain important spatial information across 3D volumes. By dynamically adjusting state transitions based on input features, Vision Mamba can selectively retain relevant information, leading to more accurate and computationally efficient processing of 3D MRI data. Our approach combines the parallelizable nature of convolutional operations during training with the efficient, recurrent processing of states during inference. This architecture not only improves computational efficiency but also enhances the model's ability to handle long-range dependencies within 3D medical images. Experimental results demonstrate that Vision Mamba outperforms traditional CNN and Transformer models accuracy, making it a promising tool for the early detection of Alzheimer's disease using 3D MRI data.
Title:
```
  Utilizing Grounded SAM for self-supervised frugal camouflaged human detection
```
Authors: Matthias Pijarowski, Alexander Wolpert, Martin Heckmann, Michael Teutsch
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Visually detecting camouflaged objects is a hard problem for both humans and computer vision algorithms. Strong similarities between object and background appearance make the task significantly more challenging than traditional object detection or segmentation tasks. Current state-of-the-art models use either convolutional neural networks or vision transformers as feature extractors. They are trained in a fully supervised manner and thus need a large amount of labeled training data. In this paper, both self-supervised and frugal learning methods are introduced to the task of Camouflaged Object Detection (COD). The overall goal is to fine-tune two COD reference methods, namely SINet-V2 and HitNet, pre-trained for camouflaged animal detection to the task of camouflaged human detection. Therefore, we use the public dataset CPD1K that contains camouflaged humans in a forest environment. We create a strong baseline using supervised frugal transfer learning for the fine-tuning task. Then, we analyze three pseudo-labeling approaches to perform the fine-tuning task in a self-supervised manner. Our experiments show that we achieve similar performance by pure self-supervision compared to fully supervised frugal learning.
Title:
```
  Learning to utilize gradient information for crisp edge detection
```
Authors: Changsong Liu, Wei Zhang, Yanyan Liu, Yuming Li, Wenlin Li, Yimeng Fan, Liang Zhang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Edge detection is a fundamental task in computer vision and it has made great progress under the development of deep convolutional neural networks (DCNNs), some of them have achieved a beyond human-level performance. However, recent top-performing edge detection methods tend to generate thick and blurred edge lines. In this work, we propose an effective method to solve this problem. Our approach consists of a lightweight pre-trained backbone, multi-scale contextual enhancement module aggregating gradient information (MCGI), boundary correction module (BCM), and boundary refinement module (BRM). In addition to this, we construct a novel hybrid loss function based on the Tversky index for solving the issue of imbalanced pixel distribution. We test our method on three standard benchmarks and the experiment results illustrate that our method improves the visual effect of edge maps and achieves a top performance among several state-of-the-art methods on the BSDS500 dataset (ODS F-score in standard evaluation is 0.829, in crispness evaluation is 0.720), NYUD-V2 dataset (ODS F-score in standard evaluation is 0.768, in crispness evaluation is \textbf{0.546}), and BIPED dataset (ODS F-score in standard evaluation is 0.903).
Title:
```
  OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer
```
Authors: Shengjian Wu, Li Sun, Qingli Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract DEtection TRansformer (DETR) becomes a dominant paradigm, mainly due to its common architecture with high accuracy and no post-processing. However, DETR suffers from unstable training dynamics. It consumes more data and epochs to converge compared with CNN-based detectors. This paper aims to stabilize DETR training through the online distillation. It utilizes a teacher model, accumulated by Exponential Moving Average (EMA), and distills its knowledge into the online model in following three aspects. First, the matching relation between object queries and ground truth (GT) boxes in the teacher is employed to guide the student, so queries within the student are not only assigned labels based on their own predictions, but also refer to the matching results from the teacher. Second, the teacher's initial query is given to the online student, and its prediction is directly constrained by the corresponding output from the teacher. Finally, the object queries from teacher's different decoding stages are used to build the auxiliary groups to accelerate the convergence. For each GT, two queries with the least matching costs are selected into this extra group, and they predict the GT box and participate the optimization. Extensive experiments show that the proposed OD-DETR successfully stabilizes the training, and significantly increases the performance without bringing in more parameters.
Title:
```
  SlowPerception: Physical-World Latency Attack against Visual Perception in Autonomous Driving
```
Authors: Chen Ma, Ningfei Wang, Zhengyu Zhao, Qi Alfred Chen, Chao Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Autonomous Driving (AD) systems critically depend on visual perception for real-time object detection and multiple object tracking (MOT) to ensure safe driving. However, high latency in these visual perception components can lead to significant safety risks, such as vehicle collisions. While previous research has extensively explored latency attacks within the digital realm, translating these methods effectively to the physical world presents challenges. For instance, existing attacks rely on perturbations that are unrealistic or impractical for AD, such as adversarial perturbations affecting areas like the sky, or requiring large patches that obscure most of a camera's view, thus making them impossible to be conducted effectively in the real world. In this paper, we introduce SlowPerception, the first physical-world latency attack against AD perception, via generating projector-based universal perturbations. SlowPerception strategically creates numerous phantom objects on various surfaces in the environment, significantly increasing the computational load of Non-Maximum Suppression (NMS) and MOT, thereby inducing substantial latency. Our SlowPerception achieves second-level latency in physical-world settings, with an average latency of 2.5 seconds across different AD perception systems, scenarios, and hardware configurations. This performance significantly outperforms existing state-of-the-art latency attacks. Additionally, we conduct AD system-level impact assessments, such as vehicle collisions, using industry-grade AD systems with production-grade AD simulators with a 97% average rate. We hope that our analyses can inspire further research in this critical domain, enhancing the robustness of AD systems against emerging vulnerabilities.
Title:
```
  SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention
```
Authors: Muhammad Nawfal Meeran, Gokul Adethya T, Bhanu Pratyush Mantha
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In the domain of large foundation models, the Segment Anything Model (SAM) has gained notable recognition for its exceptional performance in image segmentation. However, tackling the video camouflage object detection (VCOD) task presents a unique challenge. Camouflaged objects typically blend into the background, making them difficult to distinguish in still images. Additionally, ensuring temporal consistency in this context is a challenging problem. As a result, SAM encounters limitations and falls short when applied to the VCOD task. To overcome these challenges, we propose a new method called the SAM Propagation Module (SAM-PM). Our propagation module enforces temporal consistency within SAM by employing spatio-temporal cross-attention mechanisms. Moreover, we exclusively train the propagation module while keeping the SAM network weights frozen, allowing us to integrate task-specific insights with the vast knowledge accumulated by the large model. Our method effectively incorporates temporal consistency and domain-specific expertise into the segmentation network with an addition of less than 1% of SAM's parameters. Extensive experimentation reveals a substantial performance improvement in the VCOD benchmark when compared to the most recent state-of-the-art techniques. Code and pre-trained weights are open-sourced at this https URL
Title:
```
  ControlLoc: Physical-World Hijacking Attack on Visual Perception in Autonomous Driving
```
Authors: Chen Ma, Ningfei Wang, Zhengyu Zhao, Qian Wang, Qi Alfred Chen, Chao Shen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Recent research in adversarial machine learning has focused on visual perception in Autonomous Driving (AD) and has shown that printed adversarial patches can attack object detectors. However, it is important to note that AD visual perception encompasses more than just object detection; it also includes Multiple Object Tracking (MOT). MOT enhances the robustness by compensating for object detection errors and requiring consistent object detection results across multiple frames before influencing tracking results and driving decisions. Thus, MOT makes attacks on object detection alone less effective. To attack such robust AD visual perception, a digital hijacking attack has been proposed to cause dangerous driving scenarios. However, this attack has limited effectiveness. In this paper, we introduce a novel physical-world adversarial patch attack, ControlLoc, designed to exploit hijacking vulnerabilities in entire AD visual perception. ControlLoc utilizes a two-stage process: initially identifying the optimal location for the adversarial patch, and subsequently generating the patch that can modify the perceived location and shape of objects with the optimal location. Extensive evaluations demonstrate the superior performance of ControlLoc, achieving an impressive average attack success rate of around 98.1% across various AD visual perceptions and datasets, which is four times greater effectiveness than the existing hijacking attack. The effectiveness of ControlLoc is further validated in physical-world conditions, including real vehicle tests under different conditions such as outdoor light conditions with an average attack success rate of 77.5%. AD system-level impact assessments are also included, such as vehicle collision, using industry-grade AD systems and production-grade AD simulators with an average vehicle collision rate and unnecessary emergency stop rate of 81.3%.
Title:
```
  Wearable Healthcare Devices for Monitoring Stress and Attention Level in Workplace Environments
```
Authors: Peter Traunmuller, Anice Jahanjoo, Soheil Khooyooz, Amin Aminifar, Nima TaheriNejad
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Wearable devices have revolutionized healthcare monitoring, allowing us to track physiological conditions without disrupting daily routines. Whereas monitoring physical health and physical activities have been widely studied, their application and impact on mental health are significantly understudied. This work reviews the state-of-the-art, focusing on stress and concentration levels. These two can play an important role in workplace humanization. For instance, they can guide breaks in high-pressure workplaces, indicating when and how long to take. Those are important to avoid overwork and burn-out, harming employees and employers. To this end, it is necessary to study which sensors can accurately determine stress and attention levels, considering that they should not interfere with their activities and be comfortable to wear. From the software point of view, it is helpful to know the capabilities and performance of various algorithms, especially for uncontrolled workplace environments. This work aims to research, review, and compare commercially available non-intrusive measurement devices, which can be worn during the day and possibly integrated with healthcare systems for stress and concentration assessment. We analyze the performance of various algorithms used for stress and concentration level assessment and discuss future paths for reliable detection of these two parameters.
Title:
```
  PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection
```
Authors: Wei Li, Pin-Yu Chen, Sijia Liu, Ren Wang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Deep neural networks are susceptible to backdoor attacks, where adversaries manipulate model predictions by inserting malicious samples into the training data. Currently, there is still a lack of direct filtering methods for identifying suspicious training data to unveil potential backdoor samples. In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled clean validation data. PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels with dropout applied during inference, while backdoor samples exhibit less PS. We hypothesize PS results from neuron bias effect, making neurons favor features of certain classes. PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference. Extensive experiments have been conducted to verify the effectiveness and efficiency of PSBD, which achieves state-of-the-art results among mainstream detection methods.
Title:
```
  Multi-Stain Multi-Level Convolutional Network for Multi-Tissue Breast Cancer Image Segmentation
```
Authors: Akash Modi, Sumit Kumar Jha, Purnendu Mishra, Rajiv Kumar, Kiran Aatre, Gursewak Singh, Shubham Mathur
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Digital pathology and microscopy image analysis are widely employed in the segmentation of digitally scanned IHC slides, primarily to identify cancer and pinpoint regions of interest (ROI) indicative of tumor presence. However, current ROI segmentation models are either stain-specific or suffer from the issues of stain and scanner variance due to different staining protocols or modalities across multiple labs. Also, tissues like Ductal Carcinoma in Situ (DCIS), acini, etc. are often classified as Tumors due to their structural similarities and color compositions. In this paper, we proposed a novel convolutional neural network (CNN) based Multi-class Tissue Segmentation model for histopathology whole-slide Breast slides which classify tumors and segments other tissue regions such as Ducts, acini, DCIS, Squamous epithelium, Blood Vessels, Necrosis, etc. as a separate class. Our unique pixel-aligned non-linear merge across spatial resolutions empowers models with both local and global fields of view for accurate detection of various classes. Our proposed model is also able to separate bad regions such as folds, artifacts, blurry regions, bubbles, etc. from tissue regions using multi-level context from different resolutions of WSI. Multi-phase iterative training with context-aware augmentation and increasing noise was used to efficiently train a multi-stain generic model with partial and noisy annotations from 513 slides. Our training pipeline used 12 million patches generated using context-aware augmentations which made our model stain and scanner invariant across data sources. To extrapolate stain and scanner invariance, our model was evaluated on 23000 patches which were for a completely new stain (Hematoxylin and Eosin) from a completely new scanner (Motic) from a different lab. The mean IOU was 0.72 which is on par with model performance on other data sources and scanners.
Title:
```
  Mamba YOLO: SSMs-Based YOLO For Object Detection
```
Authors: Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Propelled by the rapid advancement of deep learning technologies, the YOLO series has set a new benchmark for real-time object detectors. Researchers have continuously explored innovative applications of reparameterization, efficient layer aggregation networks, and anchor-free techniques on the foundation of YOLO. To further enhance detection performance, Transformer-based structures have been introduced, significantly expanding the model's receptive field and achieving notable performance gains. However, such improvements come at a cost, as the quadratic complexity of the self-attention mechanism increases the computational burden of the model. Fortunately, the emergence of State Space Models (SSM) as an innovative technology has effectively mitigated the issues caused by quadratic complexity. In light of these advancements, we introduce Mamba-YOLO a novel object detection model based on SSM. Mamba-YOLO not only optimizes the SSM foundation but also adapts specifically for object detection tasks. Given the potential limitations of SSM in sequence modeling, such as insufficient receptive field and weak image locality, we have designed the LSBlock and RGBlock. These modules enable more precise capture of local image dependencies and significantly enhance the robustness of the model. Extensive experimental results on the publicly available benchmark datasets COCO and VOC demonstrate that Mamba-YOLO surpasses the existing YOLO series models in both performance and competitiveness, showcasing its substantial potential and competitive edge.The PyTorch code is available at:\url{this https URL}
Title:
```
  Scaling Graph Convolutions for Mobile Vision
```
Authors: William Avery, Mustafa Munir, Radu Marculescu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract To compete with existing mobile architectures, MobileViG introduces Sparse Vision Graph Attention (SVGA), a fast token-mixing operator based on the principles of GNNs. However, MobileViG scales poorly with model size, falling at most 1% behind models with similar latency. This paper introduces Mobile Graph Convolution (MGC), a new vision graph neural network (ViG) module that solves this scaling problem. Our proposed mobile vision architecture, MobileViGv2, uses MGC to demonstrate the effectiveness of our approach. MGC improves on SVGA by increasing graph sparsity and introducing conditional positional encodings to the graph operation. Our smallest model, MobileViGv2-Ti, achieves a 77.7% top-1 accuracy on ImageNet-1K, 2% higher than MobileViG-Ti, with 0.9 ms inference latency on the iPhone 13 Mini NPU. Our largest model, MobileViGv2-B, achieves an 83.4% top-1 accuracy, 0.8% higher than MobileViG-B, with 2.7 ms inference latency. Besides image classification, we show that MobileViGv2 generalizes well to other tasks. For object detection and instance segmentation on MS COCO 2017, MobileViGv2-M outperforms MobileViG-M by 1.2 $AP^{box}$ and 0.7 $AP^{mask}$, and MobileViGv2-B outperforms MobileViG-B by 1.0 $AP^{box}$ and 0.7 $AP^{mask}$. For semantic segmentation on ADE20K, MobileViGv2-M achieves 42.9% $mIoU$ and MobileViGv2-B achieves 44.3% $mIoU$. Our code can be found at \url{this https URL}.
Title:
```
  Stealthy Targeted Backdoor Attacks against Image Captioning
```
Authors: Wenshu Fan, Hongwei Li, Wenbo Jiang, Meng Hao, Shui Yu, Xiao Zhang
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In recent years, there has been an explosive growth in multimodal learning. Image captioning, a classical multimodal task, has demonstrated promising applications and attracted extensive research attention. However, recent studies have shown that image caption models are vulnerable to some security threats such as backdoor attacks. Existing backdoor attacks against image captioning typically pair a trigger either with a predefined sentence or a single word as the targeted output, yet they are unrelated to the image content, making them easily noticeable as anomalies by humans. In this paper, we present a novel method to craft targeted backdoor attacks against image caption models, which are designed to be stealthier than prior attacks. Specifically, our method first learns a special trigger by leveraging universal perturbation techniques for object detection, then places the learned trigger in the center of some specific source object and modifies the corresponding object name in the output caption to a predefined target name. During the prediction phase, the caption produced by the backdoored model for input images with the trigger can accurately convey the semantic information of the rest of the whole image, while incorrectly recognizing the source object as the predefined target. Extensive experiments demonstrate that our approach can achieve a high attack success rate while having a negligible impact on model clean performance. In addition, we show our method is stealthy in that the produced backdoor samples are indistinguishable from clean samples in both image and text domains, which can successfully bypass existing backdoor defenses, highlighting the need for better defensive mechanisms against such stealthy backdoor attacks.
Title:
```
  Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models
```
Authors: Aidan Z.H. Yang, Haoye Tian, He Ye, Ruben Martins, Claire Le Goues
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Software security vulnerabilities allow attackers to perform malicious activities to disrupt software operations. Recent Transformer-based language models have significantly advanced vulnerability detection, surpassing the capabilities of static analysis based deep learning models. However, language models trained solely on code tokens do not capture either the explanation of vulnerability type or the data flow structure information of code, both of which are crucial for vulnerability detection. We propose a novel technique that integrates a multitask sequence-to-sequence LLM with pro-gram control flow graphs encoded as a graph neural network to achieve sequence-to-classification vulnerability detection. We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul), with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset. By training LLMs and GNNs simultaneously using a combination of code and explanatory metrics of a vulnerable program, MSIVD represents a promising direction for advancing LLM-based vulnerability detection that generalizes to unseen data. Based on our findings, we further discuss the necessity for new labelled security vulnerability datasets, as recent LLMs have seen or memorized prior datasets' held-out evaluation data.
Title:
```
  M2CVD: Multi-Model Collaboration for Code Vulnerability Detection
```
Authors: Ziliang Wang, Ge Li, Jia Li, Yingfei Xiong, Jia Li, Zhi Jin
Subjects: Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization; conversely, code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. To address these challenges, this paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) that leverages the strong capability of analyzing vulnerability semantics from LLMs to improve the detection accuracy of code models. M2CVD employs a novel collaborative process: first enhancing the quality of vulnerability semantic description produced by LLMs through the understanding of project code by code models, and then using these improved vulnerability semantic description to boost the detection accuracy of code models. We demonstrated M2CVD's effectiveness on two real-world datasets, where M2CVD significantly outperformed the baseline. In addition, we demonstrate that the M2CVD collaborative method can extend to other different LLMs and code models to improve their accuracy in vulnerability detection tasks.
Title:
```
  SETC: A Vulnerability Telemetry Collection Framework
```
Authors: Ryan Holeman, John Hastings, Varghese Mathew Vaidyan
Subjects: Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract As emerging software vulnerabilities continuously threaten enterprises and Internet services, there is a critical need for improved security research capabilities. This paper introduces the Security Exploit Telemetry Collection (SETC) framework - an automated framework to generate reproducible vulnerability exploit data at scale for robust defensive security research. SETC deploys configurable environments to execute and record rich telemetry of vulnerability exploits within isolated containers. Exploits, vulnerable services, monitoring tools, and logging pipelines are defined via modular JSON configurations and deployed on demand. Compared to current manual processes, SETC enables automated, customizable, and repeatable vulnerability testing to produce diverse security telemetry. This research enables scalable exploit data generation to drive innovations in threat modeling, detection methods, analysis techniques, and remediation strategies. The capabilities of the framework are demonstrated through an example scenario. By addressing key barriers in security data generation, SETC represents a valuable platform to support impactful vulnerability and defensive security research.
Title:
```
  Open-Vocabulary Part-Based Grasping
```
Authors: Tjeard van Oort, Dimity Miller, Will N. Browne, Nicolas Marticorena, Jesse Haviland, Niko Suenderhauf
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Many robotic applications require to grasp objects not arbitrarily but at a very specific object part. This is especially important for manipulation tasks beyond simple pick-and-place scenarios or in robot-human interactions, such as object handovers. We propose AnyPart, a practical system that combines open-vocabulary object detection, open-vocabulary part segmentation and 6DOF grasp pose prediction to infer a grasp pose on a specific part of an object in 800 milliseconds. We contribute two new datasets for the task of open-vocabulary part-based grasping, a hand-segmented dataset containing 1014 object-part segmentations, and a dataset of real-world scenarios gathered during our robot trials for individual objects and table-clearing tasks. We evaluate AnyPart on a mobile manipulator robot using a set of 28 common household objects over 360 grasping trials. AnyPart is capable of producing successful grasps 69.52 %, when ignoring robot-based grasp failures, AnyPart predicts a grasp location on the correct part 88.57 % of the time.
Title:
```
  Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024
```
Authors: Jinwoo Ahn, Junhyeok Park, Min-Jun Kim, Kang-Hyeon Kim, So-Yeong Sohn, Yun-Ji Lee, Du-Seong Chang, Yu-Jung Heo, Eun-Sol Kim
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two main ideas. First, to utilize the reasoning ability of a large-scale language model (LLM), the given visual cues (images) are grounded in the text modality. For this purpose, we generate highly detailed text captions that describe the context of the image and use these captions as input for the LLM. Second, due to the nature of puzzle images, which often contain various geometric visual patterns, we utilize an object detection algorithm to ensure these patterns are not overlooked in the captioning process. We employed the SAM algorithm, which can detect various-size objects, to capture the visual features of these geometric patterns and used this information as input for the LLM. Under the puzzle split configuration, we achieved an option selection accuracy Oacc of 29.5 on the test set and a weighted option selection accuracy (WOSA) of 27.1 on the challenge set.
Title:
```
  Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook
```
Authors: Yusif Ibrahimov, Tarique Anwar, Tommy Yuan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Mental health constitutes a complex and pervasive global challenge, affecting millions of lives and often leading to severe consequences. In this paper, we conduct a thorough survey to explore the intersection of data science, artificial intelligence, and mental healthcare, focusing on the recent developments of mental disorder detection through online social media (OSM). A significant portion of the population actively engages in OSM platforms, creating a vast repository of personal data that holds immense potential for mental health analytics. The paper navigates through traditional diagnostic methods, state-of-the-art data- and AI-driven research studies, and the emergence of explainable AI (XAI) models for mental healthcare. We review state-of-the-art machine learning methods, particularly those based on modern deep learning, while emphasising the need for explainability in healthcare AI models. The experimental design section provides insights into prevalent practices, including available datasets and evaluation approaches. We also identify key issues and challenges in the field and propose promising future research directions. As mental health decisions demand transparency, interpretability, and ethical considerations, this paper contributes to the ongoing discourse on advancing XAI in mental healthcare through social media. The comprehensive overview presented here aims to guide researchers, practitioners, and policymakers in developing the area of mental disorder detection.
Title:
```
  fSEAD: a Composable FPGA-based Streaming Ensemble Anomaly Detection Library
```
Authors: Binglei Lou, David Boland, Philip H.W. Leong
Subjects: Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Machine learning ensembles combine multiple base models to produce a more accurate output. They can be applied to a range of machine learning problems, including anomaly detection. In this paper, we investigate how to maximize the composability and scalability of an FPGA-based streaming ensemble anomaly detector (fSEAD). To achieve this, we propose a flexible computing architecture consisting of multiple partially reconfigurable regions, pblocks, which each implement anomaly detectors. Our proof-of-concept design supports three state-of-the-art anomaly detection algorithms: Loda, RS-Hash and xStream. Each algorithm is scalable, meaning multiple instances can be placed within a pblock to improve performance. Moreover, fSEAD is implemented using High-level synthesis (HLS), meaning further custom anomaly detectors can be supported. Pblocks are interconnected via an AXI-switch, enabling them to be composed in an arbitrary fashion before combining and merging results at run-time to create an ensemble that maximizes the use of FPGA resources and accuracy. Through utilizing reconfigurable Dynamic Function eXchange (DFX), the detector can be modified at run-time to adapt to changing environmental conditions. We compare fSEAD to an equivalent central processing unit (CPU) implementation using four standard datasets, with speed-ups ranging from $3\times$ to $8\times$.
Title:
```
  EpiLearn: A Python Library for Machine Learning in Epidemic Modeling
```
Authors: Zewen Liu, Yunxiao Li, Mingyang Wei, Guancheng Wan, Max S.Y. Lau, Wei Jin
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract EpiLearn is a Python toolkit developed for modeling, simulating, and analyzing epidemic data. Although there exist several packages that also deal with epidemic modeling, they are often restricted to mechanistic models or traditional statistical tools. As machine learning continues to shape the world, the gap between these packages and the latest models has become larger. To bridge the gap and inspire innovative research in epidemic modeling, EpiLearn not only provides support for evaluating epidemic models based on machine learning, but also incorporates comprehensive tools for analyzing epidemic data, such as simulation, visualization, transformations, etc. For the convenience of both epidemiologists and data scientists, we provide a unified framework for training and evaluation of epidemic models on two tasks: Forecasting and Source Detection. To facilitate the development of new models, EpiLearn follows a modular design, making it flexible and easy to use. In addition, an interactive web application is also developed to visualize the real-world or simulated epidemic data. Our package is available at this https URL.
Title:
```
  ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery
```
Authors: Xian Sun, Qiwei Yan, Chubo Deng, Chenglong Liu, Yi Jiang, Zhongyan Hou, Wanxuan Lu, Fanglong Yao, Xiaoyu Liu, Lingxiang Hao, Hongfeng Yu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Scene Graph Generation (SGG) is a high-level visual understanding and reasoning task aimed at extracting entities (such as objects) and their interrelationships from images. Significant progress has been made in the study of SGG in natural images in recent years, but its exploration in the domain of remote sensing images remains very limited. The complex characteristics of remote sensing images necessitate higher time and manual interpretation costs for annotation compared to natural images. The lack of a large-scale public SGG benchmark is a major impediment to the advancement of SGG-related research in aerial imagery. In this paper, we introduce the first publicly available large-scale, million-level relation dataset in the field of remote sensing images which is named as ReCon1M. Specifically, our dataset is built upon Fair1M and comprises 21,392 images. It includes annotations for 859,751 object bounding boxes across 60 different categories, and 1,149,342 relation triplets across 64 categories based on these bounding boxes. We provide a detailed description of the dataset's characteristics and statistical information. We conducted two object detection tasks and three sub-tasks within SGG on this dataset, assessing the performance of mainstream methods on these tasks.
Title:
```
  Supervised Radio Frequency Interference Detection with SNNs
```
Authors: Nicholas J. Pritchard, Andreas Wicenec, Mohammed Bennamoun, Richard Dodson
Subjects: Subjects: Neural and Evolutionary Computing (cs.NE); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Radio Frequency Interference (RFI) poses a significant challenge in radio astronomy, arising from terrestrial and celestial sources, disrupting observations conducted by radio telescopes. Addressing RFI involves intricate heuristic algorithms, manual examination, and, increasingly, machine learning methods. Given the dynamic and temporal nature of radio astronomy observations, Spiking Neural Networks (SNNs) emerge as a promising approach. In this study, we cast RFI detection as a supervised multi-variate time-series segmentation problem. Notably, our investigation explores the encoding of radio astronomy visibility data for SNN inference, considering six encoding schemes: rate, latency, delta-modulation, and three variations of the step-forward algorithm. We train a small two-layer fully connected SNN on simulated data derived from the Hydrogen Epoch of Reionization Array (HERA) telescope and perform extensive hyper-parameter optimization. Results reveal that latency encoding exhibits superior performance, achieving a per-pixel accuracy of 98.8% and an f1-score of 0.761. Remarkably, these metrics approach those of contemporary RFI detection algorithms, notwithstanding the simplicity and compactness of our proposed network architecture. This study underscores the potential of RFI detection as a benchmark problem for SNN researchers, emphasizing the efficacy of SNNs in addressing complex time-series segmentation tasks in radio astronomy.
Title:
```
  RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection
```
Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan
Subjects: Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.
Title:
```
  Sequential Binary Classification for Intrusion Detection in Software Defined Networks
```
Authors: Ishan Chokshi, Shrihari Vasudevan, Nachiappan Sundaram, Raaghul Ranganathan
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Software-Defined Networks (SDN) are the standard architecture for network deployment. Intrusion Detection Systems (IDS) are a pivotal part of this technology as networks become more vulnerable to new and sophisticated attacks. Machine Learning (ML)-based IDS are increasingly seen as the most effective approach to handle this issue. However, IDS datasets suffer from high class imbalance, which impacts the performance of standard ML models. We propose Sequential Binary Classification (SBC) - an algorithm for multi-class classification to address this issue. SBC is a hierarchical cascade of base classifiers, each of which can be modelled on any general binary classifier. Extensive experiments are reported on benchmark datasets that evaluate the performance of SBC under different scenarios.
Title:
```
  Towards a real-time distributed feedback system for the transportation assistance of PwD
```
Authors: Iosif Polenakis, Vasileios Vouronikos, Maria Chroni, Stavros D. Nikolopoulos
Subjects: Subjects: Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this work we propose the design principles of an integrated distributed system for the augment of the transportation for people with disabilities inside the road network of a city area utilizing the IT technologies. We propose the basis of our system upon the utilization of a distributed sensor network that will be incorporated by a real-time integrated feedback system. The main components of the proposed architecture include the Inaccessible City Point System, the Live Data Analysis and Response System, and the Obstruction Detection and Prevention System. The incorporation of these subsystems will provide real-time feedback assisting the transportation of individuals with mobility problems informing them on real-time about blocked ramps across the path defined to their destination, being also responsible for the information of the authorities about incidents regarding the collision of accessibility in place where the sensors detect an inaccessible point. The proposed design allows the addition of further extensions regarding the assistance of individuals with mobility problems providing a basis for its further implementation and improvement. In this work we provide the fundamental parts regarding the interconnection of the proposed architecture's components as also its potential deployment regarding the proposed architecture and its application in the area of a city.
Title:
```
  An Effective-Efficient Approach for Dense Multi-Label Action Detection
```
Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, Adrian Hilton
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Unlike the sparse label action detection task, where a single action occurs in each timestamp of a video, in a dense multi-label scenario, actions can overlap. To address this challenging task, it is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships. Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks. However, the self-attention mechanism in transformers inherently loses temporal positional information. We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information. Preserving this information is essential for accurate action detection. In this paper, we address this issue by proposing a novel transformer-based network that (a) employs a non-hierarchical structure when modelling different ranges of temporal dependencies and (b) embeds relative positional encoding in its transformer layers. Furthermore, to model co-occurrence action relationships, current methods explicitly embed class relations into the transformer network. However, these approaches are not computationally efficient, as the network needs to compute all possible pair action class relations. We also overcome this challenge by introducing a novel learning paradigm that allows the network to benefit from explicitly modelling temporal co-occurrence action dependencies without imposing their additional computational costs during inference. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets and show that our method improves the current state-of-the-art results.
Title:
```
  2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval
```
Authors: Jiajun He, Tomoki Toda
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper proposes a novel 2-Dimensional Pointer-based Machine Reading Comprehension for Moment Retrieval Choice (2DP-2MRC) model to address the issue of imprecise localization in clip-based methods while maintaining lower computational complexity than moment-based methods. Specifically, we introduce an AV-Encoder to capture coarse-grained information at moment and video levels. Additionally, a 2D pointer encoder module is introduced to further enhance boundary detection for target moment. Extensive experiments on the HiREST dataset demonstrate that 2DP-2MRC significantly outperforms existing baseline models.
Title:
```
  Federated learning in food research
```
Authors: Zuzanna Fendor, Bas H.M. van der Velden, Xinxin Wang, Andrea Jr. Carnoli, Osman Mutlu, Ali Hürriyetoğlu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Research in the food domain is at times limited due to data sharing obstacles, such as data ownership, privacy requirements, and regulations. While important, these obstacles can restrict data-driven methods such as machine learning. Federated learning, the approach of training models on locally kept data and only sharing the learned parameters, is a potential technique to alleviate data sharing obstacles. This systematic review investigates the use of federated learning within the food domain, structures included papers in a federated learning framework, highlights knowledge gaps, and discusses potential applications. A total of 41 papers were included in the review. The current applications include solutions to water and milk quality assessment, cybersecurity of water processing, pesticide residue risk analysis, weed detection, and fraud detection, focusing on centralized horizontal federated learning. One of the gaps found was the lack of vertical or transfer federated learning and decentralized architectures.
Title:
```
  UEMM-Air: A Synthetic Multi-modal Dataset for Unmanned Aerial Vehicle Object Detection
```
Authors: Fan Liu, Liang Yao, Shengxiang Xu, Chuanyi Zhang, Xinlei Zhang, Ting Wu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The development of multi-modal object detection for Unmanned Aerial Vehicles (UAVs) typically relies on a large amount of pixel-aligned multi-modal image data. However, existing datasets face challenges such as limited modalities, high construction costs, and imprecise annotations. To this end, we propose a synthetic multi-modal UAV-based object detection dataset, UEMM-Air. Specially, we simulate various UAV flight scenarios and object types using the Unreal Engine (UE). Then we design the UAV's flight logic to automatically collect data from different scenarios, perspectives, and altitudes. Finally, we propose a novel heuristic automatic annotation algorithm to generate accurate object detection labels. In total, our UEMM-Air consists of 20k pairs of images with 5 modalities and precise annotations. Moreover, we conduct numerous experiments and establish new benchmark results on our dataset. We found that models pre-trained on UEMM-Air exhibit better performance on downstream tasks compared to other similar datasets. The dataset is publicly available (this https URL) to support the research of multi-modal UAV object detection models.
Title:
```
  UnSupDLA: Towards Unsupervised Document Layout Analysis
```
Authors: Talha Uddin Sheikh, Tahira Shehzadi, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Document layout analysis is a key area in document research, involving techniques like text mining and visual analysis. Despite various methods developed to tackle layout analysis, a critical but frequently overlooked problem is the scarcity of labeled data needed for analyses. With the rise of internet use, an overwhelming number of documents are now available online, making the process of accurately labeling them for research purposes increasingly challenging and labor-intensive. Moreover, the diversity of documents online presents a unique set of challenges in maintaining the quality and consistency of these labels, further complicating document layout analysis in the digital era. To address this, we employ a vision-based approach for analyzing document layouts designed to train a network without labels. Instead, we focus on pre-training, initially generating simple object masks from the unlabeled document images. These masks are then used to train a detector, enhancing object detection and segmentation performance. The model's effectiveness is further amplified through several unsupervised training iterations, continuously refining its performance. This approach significantly advances document layout analysis, particularly precision and efficiency, without labels.
Title:
```
  An ODMA-Based Unsourced Random Access Scheme with a Multiple Antenna Receiver
```
Authors: Mert Ozates, Mohammad Kazemi, Tolga M. Duman
Subjects: Subjects: Information Theory (cs.IT); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract We investigate the unsourced random access scheme assuming that the base station is equipped with multiple antennas, and propose a high-performing solution utilizing on-off-division multiple access. We assume that each user spreads its pilot sequence and polar codeword to the pilot and data parts of the transmission frame, respectively, based on a transmission pattern. The iterative receiver operation consists of pilot and pattern detection followed by channel vector and symbol estimation, polar decoding, and successive interference cancellation. Numerical findings demonstrate that the proposed scheme has superior performance compared to the state-of-the-art in various antenna settings.
Title:
```
  An automatic analysis of ultrasound vocalisations for the prediction of interaction context in captive Egyptian fruit bats
```
Authors: Andreas Triantafyllopoulos, Alexander Gebhard, Manuel Milling, Simon Rampp, Björn Schuller
Subjects: Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Prior work in computational bioacoustics has mostly focused on the detection of animal presence in a particular habitat. However, animal sounds contain much richer information than mere presence; among others, they encapsulate the interactions of those animals with other members of their species. Studying these interactions is almost impossible in a naturalistic setting, as the ground truth is often lacking. The use of animals in captivity instead offers a viable alternative pathway. However, most prior works follow a traditional, statistics-based approach to analysing interactions. In the present work, we go beyond this standard framework by attempting to predict the underlying context in interactions between captive \emph{Rousettus Aegyptiacus} using deep neural networks. We reach an unweighted average recall of over 30\% -- more than thrice the chance level -- and show error patterns that differ from our statistical analysis. This work thus represents an important step towards the automatic analysis of states in animals from sound.
Title:
```
  Cascading Unknown Detection with Known Classification for Open Set Recognition
```
Authors: Daniel Brignac, Abhijit Mahalanobis
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Deep learners tend to perform well when trained under the closed set assumption but struggle when deployed under open set conditions. This motivates the field of Open Set Recognition in which we seek to give deep learners the ability to recognize whether a data sample belongs to the known classes trained on or comes from the surrounding infinite world. Existing open set recognition methods typically rely upon a single function for the dual task of distinguishing between knowns and unknowns as well as making known class distinction. This dual process leaves performance on the table as the function is not specialized for either task. In this work, we introduce Cascading Unknown Detection with Known Classification (Cas-DC), where we instead learn specialized functions in a cascading fashion for both known/unknown detection and fine class classification amongst the world of knowns. Our experiments and analysis demonstrate that Cas-DC handily outperforms modern methods in open set recognition when compared using AUROC scores and correct classification rate at various true positive rates.
Title:
```
  Sustained Vowels for Pre- vs Post-Treatment COPD Classification
```
Authors: Andreas Triantafyllopoulos, Anton Batliner, Wolfgang Mayr, Markus Fendler, Florian Pokorny, Maurice Gerczuk, Shahin Amiriparian, Thomas Berghaus, Björn Schuller
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Chronic obstructive pulmonary disease (COPD) is a serious inflammatory lung disease affecting millions of people around the world. Due to an obstructed airflow from the lungs, it also becomes manifest in patients' vocal behaviour. Of particular importance is the detection of an exacerbation episode, which marks an acute phase and often requires hospitalisation and treatment. Previous work has shown that it is possible to distinguish between a pre- and a post-treatment state using automatic analysis of read speech. In this contribution, we examine whether sustained vowels can provide a complementary lens for telling apart these two states. Using a cohort of 50 patients, we show that the inclusion of sustained vowels can improve performance to up to 79\% unweighted average recall, from a 71\% baseline using read speech. We further identify and interpret the most important acoustic features that characterise the manifestation of COPD in sustained vowels.
Title:
```
  UMAD: Unsupervised Mask-Level Anomaly Detection for Autonomous Driving
```
Authors: Daniel Bogdoll, Noël Ollick, Tim Joseph, J. Marius Zöllner
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Dealing with atypical traffic scenarios remains a challenging task in autonomous driving. However, most anomaly detection approaches cannot be trained on raw sensor data but require exposure to outlier data and powerful semantic segmentation models trained in a supervised fashion. This limits the representation of normality to labeled data, which does not scale well. In this work, we revisit unsupervised anomaly detection and present UMAD, leveraging generative world models and unsupervised image segmentation. Our method outperforms state-of-the-art unsupervised anomaly detection.
Title:
```
  MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing
```
Authors: Yu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-Ping Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li Su
Subjects: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (this https URL).
Title:
```
  FPN-IAIA-BL: A Multi-Scale Interpretable Deep Learning Model for Classification of Mass Margins in Digital Mammography
```
Authors: Julia Yang, Alina Jade Barnett, Jon Donnelly, Satvik Kishore, Jerry Fang, Fides Regina Schwartz, Chaofan Chen, Joseph Y. Lo, Cynthia Rudin
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Digital mammography is essential to breast cancer detection, and deep learning offers promising tools for faster and more accurate mammogram analysis. In radiology and other high-stakes environments, uninterpretable ("black box") deep learning models are unsuitable and there is a call in these fields to make interpretable models. Recent work in interpretable computer vision provides transparency to these formerly black boxes by utilizing prototypes for case-based explanations, achieving high accuracy in applications including mammography. However, these models struggle with precise feature localization, reasoning on large portions of an image when only a small part is relevant. This paper addresses this gap by proposing a novel multi-scale interpretable deep learning model for mammographic mass margin classification. Our contribution not only offers an interpretable model with reasoning aligned with radiologist practices, but also provides a general architecture for computer vision with user-configurable prototypes from coarse- to fine-grained prototypes.
Title:
```
  A LoRa-based Energy-efficient Sensing System for Urban Data Collection
```
Authors: Lukas Schulthess, Tiago Salzmann, Christian Vogt, Michele Magno
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Nowadays, cities provide much more than shopping opportunities or working spaces. Individual locations such as parks and squares are used as meeting points and local recreation areas by many people. To ensure that they remain attractive in the future, the design of such squares must be regularly adapted to the needs of the public. These utilization trends can be derived using public data collection. The more diverse and rich the data sets are, the easier it is to optimize public space design through data analysis. Traditional data collection methods such as questionnaires, observations, or videos are either labor intensive or cannot guarantee to preserve the individual's privacy. This work presents a privacy-preserving, low-power, and low-cost smart sensing system that is capable of anonymously collecting data about public space utilization by analyzing the occupancy distribution of public seating. To support future urban planning the sensor nodes are capable of monitoring environmental noise, chair utilization, and their position, temperature, and humidity and provide them over a city-wide Long Range Wide Area Network (LoRaWAN). The final sensing system's robust operation is proven in a trial run at two public squares in a city with 16 sensor nodes over a duration of two months. By consuming 33.65 mWh per day with all subsystems enabled, including sitting detection based on a continuous acceleration measurement operating on a robust and simple threshold algorithm, the custom-designed sensor node achieves continuous monitoring during the 2-month trial run. The evaluation of the experimental results clearly shows how the two locations are used, which confirms the practicability of the proposed solution. All data collected during the field trial is publicly available as open data.
Title:
```
  Hybrid Video Anomaly Detection for Anomalous Scenarios in Autonomous Driving
```
Authors: Daniel Bogdoll, Jan Imhof, Tim Joseph, J. Marius Zöllner
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In autonomous driving, the most challenging scenarios are the ones that can only be detected within their temporal context. Most video anomaly detection approaches focus either on surveillance or traffic accidents, which are only a subfield of autonomous driving. In this work, we present HF$^2$-VAD$_{AD}$, a variation of the HF$^2$-VAD surveillance video anomaly detection method for autonomous driving. We learn a representation of normality from a vehicle's ego perspective and evaluate pixel-wise anomaly detections in rare and critical scenarios.
Title:
```
  Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation
```
Authors: Mohidul Haque Mridul, Mohammad Foysal Khan, Redwan Ahmed Rizvee, Md Mosaddek Khan
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In Multi-agent Reinforcement Learning (MARL), accurately perceiving opponents' strategies is essential for both cooperative and adversarial contexts, particularly within dynamic environments. While Proximal Policy Optimization (PPO) and related algorithms such as Actor-Critic with Experience Replay (ACER), Trust Region Policy Optimization (TRPO), and Deep Deterministic Policy Gradient (DDPG) perform well in single-agent, stationary environments, they suffer from high variance in MARL due to non-stationary and hidden policies of opponents, leading to diminished reward performance. Additionally, existing methods in MARL face significant challenges, including the need for inter-agent communication, reliance on explicit reward information, high computational demands, and sampling inefficiencies. These issues render them less effective in continuous environments where opponents may abruptly change their policies without prior notice. Against this background, we present OPS-DeMo (Online Policy Switch-Detection Model), an online algorithm that employs dynamic error decay to detect changes in opponents' policies. OPS-DeMo continuously updates its beliefs using an Assumed Opponent Policy (AOP) Bank and selects corresponding responses from a pre-trained Response Policy Bank. Each response policy is trained against consistently strategizing opponents, reducing training uncertainty and enabling the effective use of algorithms like PPO in multi-agent environments. Comparative assessments show that our approach outperforms PPO-trained models in dynamic scenarios like the Predator-Prey setting, providing greater robustness to sudden policy shifts and enabling more informed decision-making through precise opponent policy insights.
Keyword: face recognition

There is no result

Keyword: augmentation

Title:
```
  Evaluating the Effectiveness of Data Augmentation for Emotion Classification in Low-Resource Settings
```
Authors: Aashish Arora, Elsbeth Turcan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Data augmentation has the potential to improve the performance of machine learning models by increasing the amount of training data available. In this study, we evaluated the effectiveness of different data augmentation techniques for a multi-label emotion classification task using a low-resource dataset. Our results showed that Back Translation outperformed autoencoder-based approaches and that generating multiple examples per training instance led to further performance improvement. In addition, we found that Back Translation generated the most diverse set of unigrams and trigrams. These findings demonstrate the utility of Back Translation in enhancing the performance of emotion classification models in resource-limited situations.
Title:
```
  TabPFGen -- Tabular Data Generation with TabPFN
```
Authors: Junwei Ma, Apoorv Dankar, George Stein, Guangwei Yu, Anthony Caterini
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Advances in deep generative modelling have not translated well to tabular data. We argue that this is caused by a mismatch in structure between popular generative models and discriminative models of tabular data. We thus devise a technique to turn TabPFN -- a highly performant transformer initially designed for in-context discriminative tabular tasks -- into an energy-based generative model, which we dub TabPFGen. This novel framework leverages the pre-trained TabPFN as part of the energy function and does not require any additional training or hyperparameter tuning, thus inheriting TabPFN's in-context learning capability. We can sample from TabPFGen analogously to other energy-based models. We demonstrate strong results on standard generative modelling tasks, including data augmentation, class-balancing, and imputation, unlocking a new frontier of tabular data generation.
Title:
```
  Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes
```
Authors: Hao Zhang, Shuaijie Zhang, Renbin Zou
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Data augmentation refers to the process of applying a series of transformations or expansions to original data to generate new samples, thereby increasing the diversity and quantity of the data, effectively improving the performance and robustness of models. As a common data augmentation method, Mosaic data augmentation technique stitches multiple images together to increase the diversity and complexity of training data, thereby reducing the risk of overfitting. Although Mosaic data augmentation achieves excellent results in general detection tasks by stitching images together, it still has certain limitations for specific detection tasks. This paper addresses the challenge of detecting a large number of densely distributed small objects in aerial images by proposing the Select-Mosaic data augmentation method, which is improved with a fine-grained region selection strategy. The improved Select-Mosaic method demonstrates superior performance in handling dense small object detection tasks, significantly enhancing the accuracy and stability of detection models. Code is available at this https URL.
Title:
```
  Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL
```
Authors: Qi Lv, Xiang Deng, Gongwei Chen, Michael Yu Wang, Liqiang Nie
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract While the conditional sequence modeling with the transformer architecture has demonstrated its effectiveness in dealing with offline reinforcement learning (RL) tasks, it is struggle to handle out-of-distribution states and actions. Existing work attempts to address this issue by data augmentation with the learned policy or adding extra constraints with the value-based RL algorithm. However, these studies still fail to overcome the following challenges: (1) insufficiently utilizing the historical temporal information among inter-steps, (2) overlooking the local intrastep relationships among states, actions and return-to-gos (RTGs), (3) overfitting suboptimal trajectories with noisy labels. To address these challenges, we propose Decision Mamba (DM), a novel multi-grained state space model (SSM) with a self-evolving policy learning strategy. DM explicitly models the historical hidden state to extract the temporal information by using the mamba architecture. To capture the relationship among state-action-RTG triplets, a fine-grained SSM module is designed and integrated into the original coarse-grained SSM in mamba, resulting in a novel mamba architecture tailored for offline RL. Finally, to mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization. The policy evolves by using its own past knowledge to refine the suboptimal actions, thus enhancing its robustness on noisy demonstrations. Extensive experiments on various tasks show that DM outperforms other baselines substantially.
Title:
```
  Efficient Topology-aware Data Augmentation for High-Degree Graph Neural Networks
```
Authors: Yurui Lai, Xiaoyang Lin, Renchi Yang, Hongtao Wang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In recent years, graph neural networks (GNNs) have emerged as a potent tool for learning on graph-structured data and won fruitful successes in varied fields. The majority of GNNs follow the message-passing paradigm, where representations of each node are learned by recursively aggregating features of its neighbors. However, this mechanism brings severe over-smoothing and efficiency issues over high-degree graphs (HDGs), wherein most nodes have dozens (or even hundreds) of neighbors, such as social networks, transaction graphs, power grids, etc. Additionally, such graphs usually encompass rich and complex structure semantics, which are hard to capture merely by feature aggregations in GNNs. Motivated by the above limitations, we propose TADA, an efficient and effective front-mounted data augmentation framework for GNNs on HDGs. Under the hood, TADA includes two key modules: (i) feature expansion with structure embeddings, and (ii) topology- and attribute-aware graph sparsification. The former obtains augmented node features and enhanced model capacity by encoding the graph structure into high-quality structure embeddings with our highly-efficient sketching method. Further, by exploiting task-relevant features extracted from graph structures and attributes, the second module enables the accurate identification and reduction of numerous redundant/noisy edges from the input graph, thereby alleviating over-smoothing and facilitating faster feature aggregations over HDGs. Empirically, TADA considerably improves the predictive performance of mainstream GNN models on 8 real homophilic/heterophilic HDGs in terms of node classification, while achieving efficient training and inference processes.
Title:
```
  Learning Human Detected Differences in Directed Acyclic Graphs
```
Authors: Kathrin Guckes (née Ballweg), Alena Beyer, Prof. Margit Pohl, Prof. Tatiana von Landesberger
Subjects: Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Prior research has shown that human perception of similarity differs from mathematical measures in visual comparison tasks, including those involving directed acyclic graphs. This divergence can lead to missed differences and skepticism about algorithmic results. To address this, we aim to learn the structural differences humans detect in graphs visually. We want to visualize these human-detected differences alongside actual changes, enhancing credibility and aiding users in spotting overlooked differences. Our approach aligns with recent research in machine learning capturing human behavior. We provide a data augmentation algorithm, a dataset, and a machine learning model to support this task. This work fills a gap in learning differences in directed acyclic graphs and contributes to better comparative visualizations.
Title:
```
  Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking
```
Authors: Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, Lianhui Qin
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Divergent thinking, the cognitive process of generating diverse solutions, is a hallmark of human creativity and problem-solving. For machines, sampling diverse solution trajectories in complex reasoning problems is crucial for robust outcomes, data augmentation, and enhanced model generalization. Large language models (LLMs) often struggle with generating high-quality, diverse reasoning. While supervised fine-tuning helps with quality, it requires extensive supervision data to capture the full diversity of solutions. Alternatively, reinforcement learning methods like PPO aim to find limited highest-reward solutions while neglecting the solution diversity, akin to convergent thinking. To address these limitations, we propose Flow of Reasoning (FoR) -- an efficient LLM training approach enabling diverse reasoning with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow from an initial state to terminal states. The formulation allows to adapt principled GFlowNet approaches to train the LLM as a policy, which is able to sample multiple reasoning paths with probabilities proportional to the unnormalized reward. Empirical results show that, with limited training data (e.g., 15 examples), FoR can discover diverse high-quality solutions that excel greatly beyond current state-of-the-art methods across three tasks, including embodied reasoning (BlocksWorld), math puzzle solving (Game24), and logical reasoning (PrOntoQA). Code is available at this https URL.
Title:
```
  SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion
```
Authors: Bingsong Bai, Fengping Wang, Yingming Gao, Ya Li
Subjects: Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.
Title:
```
  ProFeAT: Projected Feature Adversarial Training for Self-Supervised Learning of Robust Representations
```
Authors: Sravanti Addepalli, Priyam Dey, R. Venkatesh Babu
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The need for abundant labelled data in supervised Adversarial Training (AT) has prompted the use of Self-Supervised Learning (SSL) techniques with AT. However, the direct application of existing SSL methods to adversarial training has been sub-optimal due to the increased training complexity of combining SSL with AT. A recent approach, DeACL, mitigates this by utilizing supervision from a standard SSL teacher in a distillation setting, to mimic supervised AT. However, we find that there is still a large performance gap when compared to supervised adversarial training, specifically on larger models. In this work, investigate the key reason for this gap and propose Projected Feature Adversarial Training (ProFeAT) to bridge the same. We show that the sub-optimal distillation performance is a result of mismatch in training objectives of the teacher and student, and propose to use a projection head at the student, that allows it to leverage weak supervision from the teacher while also being able to learn adversarially robust representations that are distinct from the teacher. We further propose appropriate attack and defense losses at the feature and projector, alongside a combination of weak and strong augmentations for the teacher and student respectively, to improve the training data diversity without increasing the training complexity. Through extensive experiments on several benchmark datasets and models, we demonstrate significant improvements in both clean and robust accuracy when compared to existing SSL-AT methods, setting a new state-of-the-art. We further report on-par/ improved performance when compared to TRADES, a popular supervised-AT method.
Title:
```
  Multi-Stain Multi-Level Convolutional Network for Multi-Tissue Breast Cancer Image Segmentation
```
Authors: Akash Modi, Sumit Kumar Jha, Purnendu Mishra, Rajiv Kumar, Kiran Aatre, Gursewak Singh, Shubham Mathur
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Digital pathology and microscopy image analysis are widely employed in the segmentation of digitally scanned IHC slides, primarily to identify cancer and pinpoint regions of interest (ROI) indicative of tumor presence. However, current ROI segmentation models are either stain-specific or suffer from the issues of stain and scanner variance due to different staining protocols or modalities across multiple labs. Also, tissues like Ductal Carcinoma in Situ (DCIS), acini, etc. are often classified as Tumors due to their structural similarities and color compositions. In this paper, we proposed a novel convolutional neural network (CNN) based Multi-class Tissue Segmentation model for histopathology whole-slide Breast slides which classify tumors and segments other tissue regions such as Ducts, acini, DCIS, Squamous epithelium, Blood Vessels, Necrosis, etc. as a separate class. Our unique pixel-aligned non-linear merge across spatial resolutions empowers models with both local and global fields of view for accurate detection of various classes. Our proposed model is also able to separate bad regions such as folds, artifacts, blurry regions, bubbles, etc. from tissue regions using multi-level context from different resolutions of WSI. Multi-phase iterative training with context-aware augmentation and increasing noise was used to efficiently train a multi-stain generic model with partial and noisy annotations from 513 slides. Our training pipeline used 12 million patches generated using context-aware augmentations which made our model stain and scanner invariant across data sources. To extrapolate stain and scanner invariance, our model was evaluated on 23000 patches which were for a completely new stain (Hematoxylin and Eosin) from a completely new scanner (Motic) from a different lab. The mean IOU was 0.72 which is on par with model performance on other data sources and scanners.
Title:
```
  Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation
```
Authors: Jun Yu, Yunxiang Zhang, Fengzhao Sun, Leilei Wang, Renjie Lu
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In this report, we present our solution for the semantic segmentation in adverse weather, in UG2+ Challenge at CVPR 2024. To achieve robust and accurate segmentation results across various weather conditions, we initialize the InternImage-H backbone with pre-trained weights from the large-scale joint dataset and enhance it with the state-of-the-art Upernet segmentation method. Specifically, we utilize offline and online data augmentation approaches to extend the train set, which helps us to further improve the performance of the segmenter. As a result, our proposed solution demonstrates advanced performance on the test set and achieves 3rd position in this challenge.
Title:
```
  Causality-inspired Latent Feature Augmentation for Single Domain Generalization
```
Authors: Jian Xu, Chaojie Ji, Yankai Cao, Ye Li, Ruxin Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Single domain generalization (Single-DG) intends to develop a generalizable model with only one single training domain to perform well on other unknown target domains. Under the domain-hungry configuration, how to expand the coverage of source domain and find intrinsic causal features across different distributions is the key to enhancing the models' generalization ability. Existing methods mainly depend on the meticulous design of finite image-level transformation techniques and learning invariant features across domains based on statistical correlation between samples and labels in source domain. This makes it difficult to capture stable semantics between source and target domains, which hinders the improvement of the model's generalization performance. In this paper, we propose a novel causality-inspired latent feature augmentation method for Single-DG by learning the meta-knowledge of feature-level transformation based on causal learning and interventions. Instead of strongly relying on the finite image-level transformation, with the learned meta-knowledge, we can generate diverse implicit feature-level transformations in latent space based on the consistency of causal features and diversity of non-causal features, which can better compensate for the domain-hungry defect and reduce the strong reliance on initial finite image-level transformations and capture more stable domain-invariant causal features for generalization. Extensive experiments on several open-access benchmarks demonstrate the outstanding performance of our model over other state-of-the-art single domain generalization and also multi-source domain generalization methods.
Title:
```
  Enhancing Long-Term Memory using Hierarchical Aggregate Tree for Retrieval Augmented Generation
```
Authors: Aadharsh Aadhithya A, Sachin Kumar S, Soman K.P
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Large language models have limited context capacity, hindering reasoning over long conversations. We propose the Hierarchical Aggregate Tree memory structure to recursively aggregate relevant dialogue context through conditional tree traversals. HAT encapsulates information from children nodes, enabling broad coverage with depth control. We formulate finding best context as optimal tree traversal. Experiments show HAT improves dialog coherence and summary quality over baseline contexts, demonstrating the techniques effectiveness for multi turn reasoning without exponential parameter growth. This memory augmentation enables more consistent, grounded longform conversations from LLMs
Title:
```
  Comparing Data Augmentation Methods for End-to-End Task-Oriented Dialog Systems
```
Authors: Christos Vlachos, Themos Stafylakis, Ion Androutsopoulos
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Creating effective and reliable task-oriented dialog systems (ToDSs) is challenging, not only because of the complex structure of these systems, but also due to the scarcity of training data, especially when several modules need to be trained separately, each one with its own input/output training examples. Data augmentation (DA), whereby synthetic training examples are added to the training data, has been successful in other NLP systems, but has not been explored as extensively in ToDSs. We empirically evaluate the effectiveness of DA methods in an end-to-end ToDS setting, where a single system is trained to handle all processing stages, from user inputs to system outputs. We experiment with two ToDSs (UBAR, GALAXY) on two datasets (MultiWOZ, KVRET). We consider three types of DA methods (word-level, sentence-level, dialog-level), comparing eight DA methods that have shown promising results in ToDSs and other NLP systems. We show that all DA methods considered are beneficial, and we highlight the best ones, also providing advice to practitioners. We also introduce a more challenging few-shot cross-domain ToDS setting, reaching similar conclusions.
Title:
```
  Data Augmentation in Earth Observation: A Diffusion Model Approach
```
Authors: Tiago Sousa, Benoît Ries, Nicolas Guelfi
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract The scarcity of high-quality Earth Observation (EO) imagery poses a significant challenge, despite its critical role in enabling precise analysis and informed decision-making across various sectors. This scarcity is primarily due to atmospheric conditions, seasonal variations, and limited geographical coverage, which complicates the application of Artificial Intelligence (AI) in EO. Data augmentation, a widely used technique in AI that involves generating additional data mainly through parameterized image transformations, has been employed to increase the volume and diversity of data. However, this method often falls short in generating sufficient diversity across key semantic axes, adversely affecting the accuracy of EO applications. To address this issue, we propose a novel four-stage approach aimed at improving the diversity of augmented data by integrating diffusion models. Our approach employs meta-prompts for instruction generation, harnesses general-purpose vision-language models for generating rich captions, fine-tunes an Earth Observation diffusion model, and iteratively augments data. We conducted extensive experiments using four different data augmentation techniques, and our approach consistently demonstrated improvements, outperforming the established augmentation methods, revealing its effectiveness in generating semantically rich and diverse EO images.
Title:
```
  Compute Better Spent: Replacing Dense Layers with Structured Matrices
```
Authors: Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Dense linear layers are the dominant computational bottleneck in foundation models. Identifying more efficient alternatives to dense matrices has enormous potential for building more compute-efficient models, as exemplified by the success of convolutional networks in the image domain. In this work, we systematically explore structured matrices as replacements for dense matrices. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance, especially as models scale. Using insights from the Maximal Update Parameterization, we determine the optimal scaling for initialization and learning rates of these unconventional layers. Finally, we measure the scaling laws of different structures to compare how quickly their performance improves with compute. We propose a novel matrix family containing Monarch matrices, the Block Tensor-Train (BTT), which we show performs better than dense matrices for the same compute on multiple tasks. On CIFAR-10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and ViTs. BTT matches dense ViT-S/32 performance on ImageNet-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models.
Title:
```
  Zero-Shot Audio Captioning Using Soft and Hard Prompts
```
Authors: Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma
Subjects: Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic this http URL the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.
Title:
```
  Improving Deep Learning-based Automatic Cranial Defect Reconstruction by Heavy Data Augmentation: From Image Registration to Latent Diffusion Models
```
Authors: Marek Wodzinski, Kamil Kwarciak, Mateusz Daniol, Daria Hemmerling
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Modeling and manufacturing of personalized cranial implants are important research areas that may decrease the waiting time for patients suffering from cranial damage. The modeling of personalized implants may be partially automated by the use of deep learning-based methods. However, this task suffers from difficulties with generalizability into data from previously unseen distributions that make it difficult to use the research outcomes in real clinical settings. Due to difficulties with acquiring ground-truth annotations, different techniques to improve the heterogeneity of datasets used for training the deep networks have to be considered and introduced. In this work, we present a large-scale study of several augmentation techniques, varying from classical geometric transformations, image registration, variational autoencoders, and generative adversarial networks, to the most recent advances in latent diffusion models. We show that the use of heavy data augmentation significantly increases both the quantitative and qualitative outcomes, resulting in an average Dice Score above 0.94 for the SkullBreak and above 0.96 for the SkullFix datasets. Moreover, we show that the synthetically augmented network successfully reconstructs real clinical defects. The work is a considerable contribution to the field of artificial intelligence in the automatic modeling of personalized cranial implants.
Title:
```
  Data Augmentation for Multivariate Time Series Classification: An Experimental Study
```
Authors: Romain Ilbert, Thai V. Hoang, Zonghua Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/
Pdf link: https://arxiv.org/pdf/
Abstract Our study investigates the impact of data augmentation on the performance of multivariate time series models, focusing on datasets from the UCR archive. Despite the limited size of these datasets, we achieved classification accuracy improvements in 10 out of 13 datasets using the Rocket and InceptionTime models. This highlights the essential role of sufficient data in training effective models, paralleling the advancements seen in computer vision. Our work delves into adapting and applying existing methods in innovative ways to the domain of multivariate time series classification. Our comprehensive exploration of these techniques sets a new standard for addressing data scarcity in time series analysis, emphasizing that diverse augmentation strategies are crucial for unlocking the potential of both traditional and deep learning models. Moreover, by meticulously analyzing and applying a variety of augmentation techniques, we demonstrate that strategic data enrichment can enhance model accuracy. This not only establishes a benchmark for future research in time series analysis but also underscores the importance of adopting varied augmentation approaches to improve model performance in the face of limited data availability.

LeeKyungwook / get-arxiv-noti

New submissions for Tuesday, 11 June 2024 (showing 666 of 666 entries ) #1141

Keyword: detection

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Keyword: face recognition

Keyword: augmentation

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title:

Title: