New submissions for Wed, 10 Apr 24

Keyword: detection

Localizing Moments of Actions in Untrimmed Videos of Infants with Autism Spectrum Disorder

Authors: Authors: Halil Ismail Helvaci, Sen-ching Samson Cheung, Chen-Nee Chuah, Sally Ozonoff
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.05849
Pdf link: https://arxiv.org/pdf/2404.05849
Abstract Autism Spectrum Disorder (ASD) presents significant challenges in early diagnosis and intervention, impacting children and their families. With prevalence rates rising, there is a critical need for accessible and efficient screening tools. Leveraging machine learning (ML) techniques, in particular Temporal Action Localization (TAL), holds promise for automating ASD screening. This paper introduces a self-attention based TAL model designed to identify ASD-related behaviors in infant videos. Unlike existing methods, our approach simplifies complex modeling and emphasizes efficiency, which is essential for practical deployment in real-world scenarios. Importantly, this work underscores the importance of developing computer vision methods capable of operating in naturilistic environments with little equipment control, addressing key challenges in ASD screening. This study is the first to conduct end-to-end temporal action localization in untrimmed videos of infants with ASD, offering promising avenues for early intervention and support. We report baseline results of behavior detection using our TAL model. We achieve 70% accuracy for look face, 79% accuracy for look object, 72% for smile and 65% for vocalization.
Towards Improved Semiconductor Defect Inspection for high-NA EUVL based on SEMI-SuperYOLO-NAS
Authors: Authors: Ying-Lin Chen, Jacob Deforce, Vic De Ridder, Bappaditya Dey, Victor Blanco, Sandip Halder, Philippe Leray
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.05862
Pdf link: https://arxiv.org/pdf/2404.05862
Abstract Due to potential pitch reduction, the semiconductor industry is adopting High-NA EUVL technology. However, its low depth of focus presents challenges for High Volume Manufacturing. To address this, suppliers are exploring thinner photoresists and new underlayers/hardmasks. These may suffer from poor SNR, complicating defect detection. Vision-based ML algorithms offer a promising solution for semiconductor defect inspection. However, developing a robust ML model across various image resolutions without explicit training remains a challenge for nano-scale defect inspection. This research's goal is to propose a scale-invariant ADCD framework capable to upscale images, addressing this issue. We propose an improvised ADCD framework as SEMI-SuperYOLO-NAS, which builds upon the baseline YOLO-NAS architecture. This framework integrates a SR assisted branch to aid in learning HR features by the defect detection backbone, particularly for detecting nano-scale defect instances from LR images. Additionally, the SR-assisted branch can recursively generate upscaled images from their corresponding downscaled counterparts, enabling defect detection inference across various image resolutions without requiring explicit training. Moreover, we investigate improved data augmentation strategy aimed at generating diverse and realistic training datasets to enhance model performance. We have evaluated our proposed approach using two original FAB datasets obtained from two distinct processes and captured using two different imaging tools. Finally, we demonstrate zero-shot inference for our model on a new, originating from a process condition distinct from the training dataset and possessing different Pitch characteristics. Experimental validation demonstrates that our proposed ADCD framework aids in increasing the throughput of imaging tools for defect inspection by reducing the required image pixel resolutions.
On the Fly Robotic-Assisted Medical Instrument Planning and Execution Using Mixed Reality
Authors: Authors: Letian Ai, Yihao Liu, Mehran Armand, Amir Kheradmand, Alejandro Martin-Gomez
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.05887
Pdf link: https://arxiv.org/pdf/2404.05887
Abstract Robotic-assisted medical systems (RAMS) have gained significant attention for their advantages in alleviating surgeons' fatigue and improving patients' outcomes. These systems comprise a range of human-computer interactions, including medical scene monitoring, anatomical target planning, and robot manipulation. However, despite its versatility and effectiveness, RAMS demands expertise in robotics, leading to a high learning cost for the operator. In this work, we introduce a novel framework using mixed reality technologies to ease the use of RAMS. The proposed framework achieves real-time planning and execution of medical instruments by providing 3D anatomical image overlay, human-robot collision detection, and robot programming interface. These features, integrated with an easy-to-use calibration method for head-mounted display, improve the effectiveness of human-robot interactions. To assess the feasibility of the framework, two medical applications are presented in this work: 1) coil placement during transcranial magnetic stimulation and 2) drill and injector device positioning during femoroplasty. Results from these use cases demonstrate its potential to extend to a wider range of medical scenarios.
Interference Reduction Design for Improved Multitarget Detection in ISAC Systems
Authors: Authors: Mamady Delamou, El Mehdi Amhoud
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2404.05895
Pdf link: https://arxiv.org/pdf/2404.05895
Abstract The advancement of wireless communication systems toward 5G and beyond is spurred by the demand for high data rates, exceedingly dependable low-latency communication, and extensive connectivity that aligns with sensing requisites such as advanced high-resolution sensing and target detection. Consequently, embedding sensing into communication has gained considerable attention. In this work, we propose an alternative approach for optimizing integrated sensing and communication (ISAC) waveform for target detection by concurrently maximizing the power of the communication signal at an intended user and minimizing the multi-user and sensing interference. We formulate the problem as a non-disciplined convex programming (NDCP) optimization and we use a distribution-based approach for interference cancellation. Precisely, we establish the distribution of the communication signal and the multi-user communication interference received by the intended user, and thereafter, we establish that the sensing interference can be distributed as a centralized Chi-squared if the sensing covariance matrix is idempotent. We design such a matrix based on the symmetrical idempotent property. Additionally, we propose a disciplined convex programming (DCP) form of the problem, and using successive convex approximation (SCA), we show that the solutions can reach a stable waveform for efficient target detection. Furthermore, we compare the proposed waveform with state of the art radar-communication waveform designs and demonstrate its superior performance by computer simulations.
ClusterRadar: an Interactive Web-Tool for the Multi-Method Exploration of Spatial Clusters Over Time
Authors: Authors: Lee Mason, Blánaid Hicks, Jonas S. Almeida
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2404.05897
Pdf link: https://arxiv.org/pdf/2404.05897
Abstract Spatial cluster analysis, the detection of localized patterns of similarity in geospatial data, has a wide-range of applications for scientific discovery and practical decision making. One way to detect spatial clusters is by using local indicators of spatial association, such as Local Moran's I or Getis-Ord Gi*. However, different indicators tend to produce substantially different results due to their distinct operational characteristics. Choosing a suitable method or comparing results from multiple methods is a complex task. Furthermore, spatial clusters are dynamic and it is often useful to track their evolution over time, which adds an additional layer of complexity. ClusterRadar is a web-tool designed to address these analytical challenges. The tool allows users to easily perform spatial clustering and analyze the results in an interactive environment, uniquely prioritizing temporal analysis and the comparison of multiple methods. The tool's interactive dashboard presents several visualizations, each offering a distinct perspective of the temporal and methodological aspects of the spatial clustering results. ClusterRadar has several features designed to maximize its utility to a broad user-base, including support for various geospatial formats, and a fully in-browser execution environment to preserve the privacy of sensitive data. Feedback from a varied set of researchers suggests ClusterRadar's potential for enhancing the temporal analysis of spatial clusters.
Deep Learning-Based Out-of-distribution Source Code Data Identification: How Far We Have Gone?
Authors: Authors: Van Nguyen, Xingliang Yuan, Tingmin Wu, Surya Nepal, Marthie Grobler, Carsten Rudolph
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2404.05964
Pdf link: https://arxiv.org/pdf/2404.05964
Abstract Software vulnerabilities (SVs) have become a common, serious, and crucial concern to safety-critical security systems. That leads to significant progress in the use of AI-based methods for software vulnerability detection (SVD). In practice, although AI-based methods have been achieving promising performances in SVD and other domain applications (e.g., computer vision), they are well-known to fail in detecting the ground-truth label of input data (referred to as out-of-distribution, OOD, data) lying far away from the training data distribution (i.e., in-distribution, ID). This drawback leads to serious issues where the models fail to indicate when they are likely mistaken. To address this problem, OOD detectors (i.e., determining whether an input is ID or OOD) have been applied before feeding the input data to the downstream AI-based modules. While OOD detection has been widely designed for computer vision and medical diagnosis applications, automated AI-based techniques for OOD source code data detection have not yet been well-studied and explored. To this end, in this paper, we propose an innovative deep learning-based approach addressing the OOD source code data identification problem. Our method is derived from an information-theoretic perspective with the use of innovative cluster-contrastive learning to effectively learn and leverage source code characteristics, enhancing data representation learning for solving the problem. The rigorous and comprehensive experiments on real-world source code datasets show the effectiveness and advancement of our approach compared to state-of-the-art baselines by a wide margin. In short, on average, our method achieves a significantly higher performance from around 15.27%, 7.39%, and 4.93% on the FPR, AUROC, and AUPR measures, respectively, in comparison with the baselines.
On Achievable Covert Communication Performance under CSI Estimation Error and Feedback Delay
Authors: Authors: Jiaqing Bai, Ji He, Yanping Chen, Yulong Shen, Xiaohong Jiang
Subjects: Information Theory (cs.IT)
Arxiv link: https://arxiv.org/abs/2404.05983
Pdf link: https://arxiv.org/pdf/2404.05983
Abstract Covert communication's effectiveness critically depends on precise channel state information (CSI). This paper investigates the impact of imperfect CSI on achievable covert communication performance in a two-hop relay system. Firstly, we introduce a two-hop covert transmission scheme utilizing channel inversion power control (CIPC) to manage opportunistic interference, eliminating the receiver's self-interference. Given that CSI estimation error (CEE) and feedback delay (FD) are the two primary factors leading to imperfect CSI, we construct a comprehensive theoretical model to accurately characterize their effects on CSI quality. With the aid of this model, we then derive closed-form solutions for detection error probability (DEP) and covert rate (CR), establishing an analytical framework to delineate the inherent relationship between CEE, FD, and covert performance. Furthermore, to mitigate the adverse effects of imperfect CSI on achievable covert performance, we investigate the joint optimization of channel inversion power and data symbol length to maximize CR under DEP constraints and propose an iterative alternating algorithm to solve the bi-dimensional non-convex optimization problem. Finally, extensive experimental results validate our theoretical framework and illustrate the impact of imperfect CSI on achievable covert performance.
Boosting Digital Safeguards: Blending Cryptography and Steganography
Authors: Authors: Anamitra Maiti, Subham Laha, Rishav Upadhaya, Soumyajit Biswas, Vikas Choudhary, Biplab Kar, Nikhil Kumar, Jaydip Sen
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.05985
Pdf link: https://arxiv.org/pdf/2404.05985
Abstract In today's digital age, the internet is essential for communication and the sharing of information, creating a critical need for sophisticated data security measures to prevent unauthorized access and exploitation. Cryptography encrypts messages into a cipher text that is incomprehensible to unauthorized readers, thus safeguarding data during its transmission. Steganography, on the other hand, originates from the Greek term for "covered writing" and involves the art of hiding data within another medium, thereby facilitating covert communication by making the message invisible. This proposed approach takes advantage of the latest advancements in Artificial Intelligence (AI) and Deep Learning (DL), especially through the application of Generative Adversarial Networks (GANs), to improve upon traditional steganographic methods. By embedding encrypted data within another medium, our method ensures that the communication remains hidden from prying eyes. The application of GANs enables a smart, secure system that utilizes the inherent sensitivity of neural networks to slight alterations in data, enhancing the protection against detection. By merging the encryption techniques of cryptography with the hiding capabilities of steganography, and augmenting these with the strengths of AI, we introduce a comprehensive security system designed to maintain both the privacy and integrity of information. This system is crafted not just to prevent unauthorized access or modification of data, but also to keep the existence of the data hidden. This fusion of technologies tackles the core challenges of data security in the current era of open digital communication, presenting an advanced solution with the potential to transform the landscape of information security.
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
Authors: Authors: Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Zhengran Zeng, Wei Ye, Jindong Wang, Yue Zhang, Shikun Zhang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.06003
Pdf link: https://arxiv.org/pdf/2404.06003
Abstract The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and efficiency. Currently, there is a notable absence of a unified and adaptable framework that seamlessly integrates various evaluation approaches. Moreover, the reliability of evaluation findings is often questionable due to potential data contamination, with the evaluation efficiency commonly overlooked when facing the substantial costs associated with LLM inference. In response to these challenges, we introduce FreeEval, a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of LLMs. Firstly, FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies, encompassing dynamic evaluation that demand sophisticated LLM interactions. Secondly, the framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules in the platform, enhance the fairness of the evaluation outcomes. Lastly, FreeEval is designed with a high-performance infrastructure, including distributed computation and caching strategies, enabling extensive evaluations across multi-node, multi-GPU clusters for open-source and proprietary LLMs.
Band-Attention Modulated RetNet for Face Forgery Detection
Authors: Authors: Zhida Zhang, Jie Cao, Wenkui Yang, Qihang Fan, Kai Zhou, Ran He
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2404.06022
Pdf link: https://arxiv.org/pdf/2404.06022
Abstract The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational complexity.To mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed to efficiently process extensive visual contexts while avoiding catastrophic forgetting.Our approach empowers the target token to perceive global information by assigning differential attention levels to tokens at varying distances. We implement self-attention along both spatial axes, thereby maintaining spatial priors and easing the computational burden.Moreover, we present the adaptive frequency Band-Attention Modulation mechanism, which treats the entire Discrete Cosine Transform spectrogram as a series of frequency bands with learnable weights.Together, BAR-Net achieves favorable performance on several face forgery datasets, outperforming current state-of-the-art methods.
Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation
Authors: Authors: Zong-Wei Hong, Yu-Chen Lin
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06029
Pdf link: https://arxiv.org/pdf/2404.06029
Abstract The domain of computer vision has experienced significant advancements in facial-landmark detection, becoming increasingly essential across various applications such as augmented reality, facial recognition, and emotion analysis. Unlike object detection or semantic segmentation, which focus on identifying objects and outlining boundaries, faciallandmark detection aims to precisely locate and track critical facial features. However, deploying deep learning-based facial-landmark detection models on embedded systems with limited computational resources poses challenges due to the complexity of facial features, especially in dynamic settings. Additionally, ensuring robustness across diverse ethnicities and expressions presents further obstacles. Existing datasets often lack comprehensive representation of facial nuances, particularly within populations like those in Taiwan. This paper introduces a novel approach to address these challenges through the development of a knowledge distillation method. By transferring knowledge from larger models to smaller ones, we aim to create lightweight yet powerful deep learning models tailored specifically for facial-landmark detection tasks. Our goal is to design models capable of accurately locating facial landmarks under varying conditions, including diverse expressions, orientations, and lighting environments. The ultimate objective is to achieve high accuracy and real-time performance suitable for deployment on embedded systems. This method was successfully implemented and achieved a top 6th place finish out of 165 participants in the IEEE ICME 2024 PAIR competition.
Gaussian Pancakes: Geometrically-Regularized 3D Gaussian Splatting for Realistic Endoscopic Reconstruction
Authors: Authors: Sierra Bonilla, Shuai Zhang, Dimitrios Psychogyios, Danail Stoyanov, Francisco Vasconcelos, Sophia Bano
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06128
Pdf link: https://arxiv.org/pdf/2404.06128
Abstract Within colorectal cancer diagnostics, conventional colonoscopy techniques face critical limitations, including a limited field of view and a lack of depth information, which can impede the detection of precancerous lesions. Current methods struggle to provide comprehensive and accurate 3D reconstructions of the colonic surface which can help minimize the missing regions and reinspection for pre-cancerous polyps. Addressing this, we introduce 'Gaussian Pancakes', a method that leverages 3D Gaussian Splatting (3D GS) combined with a Recurrent Neural Network-based Simultaneous Localization and Mapping (RNNSLAM) system. By introducing geometric and depth regularization into the 3D GS framework, our approach ensures more accurate alignment of Gaussians with the colon surface, resulting in smoother 3D reconstructions with novel viewing of detailed textures and structures. Evaluations across three diverse datasets show that Gaussian Pancakes enhances novel view synthesis quality, surpassing current leading methods with a 18% boost in PSNR and a 16% improvement in SSIM. It also delivers over 100X faster rendering and more than 10X shorter training times, making it a practical tool for real-time applications. Hence, this holds promise for achieving clinical translation for better detection and diagnosis of colorectal cancer.
SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection
Authors: Authors: Elisei Rykov, Yana Shishkina, Kseniia Petrushina, Kseniia Titova, Sergey Petrakov, Alexander Panchenko
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.06137
Pdf link: https://arxiv.org/pdf/2404.06137
Abstract In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition's model-agnostic track and 17th place in model-aware track, highlighting its effectiveness and potential.
Differential Privacy for Anomaly Detection: Analyzing the Trade-off Between Privacy and Explainability
Authors: Authors: Fatima Ezzeddine, Mirna Saad, Omran Ayoub, Davide Andreoletti, Martin Gjoreski, Ihab Sbeity, Marc Langheinrich, Silvia Giordano
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.06144
Pdf link: https://arxiv.org/pdf/2404.06144
Abstract Anomaly detection (AD), also referred to as outlier detection, is a statistical process aimed at identifying observations within a dataset that significantly deviate from the expected pattern of the majority of the data. Such a process finds wide application in various fields, such as finance and healthcare. While the primary objective of AD is to yield high detection accuracy, the requirements of explainability and privacy are also paramount. The first ensures the transparency of the AD process, while the second guarantees that no sensitive information is leaked to untrusted parties. In this work, we exploit the trade-off of applying Explainable AI (XAI) through SHapley Additive exPlanations (SHAP) and differential privacy (DP). We perform AD with different models and on various datasets, and we thoroughly evaluate the cost of privacy in terms of decreased accuracy and explainability. Our results show that the enforcement of privacy through DP has a significant impact on detection accuracy and explainability, which depends on both the dataset and the considered AD model. We further show that the visual interpretation of explanations is also influenced by the choice of the AD algorithm.
Enhanced Radar Perception via Multi-Task Learning: Towards Refined Data for Sensor Fusion Applications
Authors: Authors: Huawei Sun, Hao Feng, Gianfranco Mauro, Julius Ott, Georg Stettinger, Lorenzo Servadei, Robert Wille
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2404.06165
Pdf link: https://arxiv.org/pdf/2404.06165
Abstract Radar and camera fusion yields robustness in perception tasks by leveraging the strength of both sensors. The typical extracted radar point cloud is 2D without height information due to insufficient antennas along the elevation axis, which challenges the network performance. This work introduces a learning-based approach to infer the height of radar points associated with 3D objects. A novel robust regression loss is introduced to address the sparse target challenge. In addition, a multi-task training strategy is employed, emphasizing important features. The average radar absolute height error decreases from 1.69 to 0.25 meters compared to the state-of-the-art height extension method. The estimated target height values are used to preprocess and enrich radar data for downstream perception tasks. Integrating this refined radar information further enhances the performance of existing radar camera fusion models for object detection and depth estimation tasks.
YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images
Authors: Authors: Chenguang Liu, Guangshuai Gao, Ziyue Huang, Zhenghui Hu, Qingjie Liu, Yunhong Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06180
Pdf link: https://arxiv.org/pdf/2404.06180
Abstract Detecting objects from aerial images poses significant challenges due to the following factors: 1) Aerial images typically have very large sizes, generally with millions or even hundreds of millions of pixels, while computational resources are limited. 2) Small object size leads to insufficient information for effective detection. 3) Non-uniform object distribution leads to computational resource wastage. To address these issues, we propose YOLC (You Only Look Clusters), an efficient and effective framework that builds on an anchor-free object detector, CenterNet. To overcome the challenges posed by large-scale images and non-uniform object distribution, we introduce a Local Scale Module (LSM) that adaptively searches cluster regions for zooming in for accurate detection. Additionally, we modify the regression loss using Gaussian Wasserstein distance (GWD) to obtain high-quality bounding boxes. Deformable convolution and refinement methods are employed in the detection head to enhance the detection of small objects. We perform extensive experiments on two aerial image datasets, including Visdrone2019 and UAVDT, to demonstrate the effectiveness and superiority of our proposed approach.
Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection
Authors: Authors: Ting Lei, Shaofeng Yin, Yang Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06194
Pdf link: https://arxiv.org/pdf/2404.06194
Abstract Open-vocabulary human-object interaction (HOI) detection, which is concerned with the problem of detecting novel HOIs guided by natural language, is crucial for understanding human-centric scenes. However, prior zero-shot HOI detectors often employ the same levels of feature maps to model HOIs with varying distances, leading to suboptimal performance in scenes containing human-object pairs with a wide range of distances. In addition, these detectors primarily rely on category names and overlook the rich contextual information that language can provide, which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone. In this paper, we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE), harnessing the potential of Visual-Language Models (VLMs). Specifically, we propose to model human-object pairs with different distances with different levels of feature maps by incorporating a soft constraint during the bipartite matching process. Furthermore, by leveraging large language models (LLMs) such as GPT models, we exploit their extensive world knowledge to generate descriptions of human body part states for various interactions. Then we integrate the generalizable and fine-grained semantics of human body parts to improve interaction recognition. Experimental results on two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method achieves state-of-the-art results in open vocabulary HOI detection. The code and models are available at https://github.com/ltttpku/CMD-SE-release.
Unified Physical-Digital Attack Detection Challenge
Authors: Authors: Haocheng Yuan, Ajian Liu, Junze Zheng, Jun Wan, Jiankang Deng, Sergio Escalera, Hugo Jair Escalante, Isabelle Guyon, Zhen Lei
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06211
Pdf link: https://arxiv.org/pdf/2404.06211
Abstract Face Anti-Spoofing (FAS) is crucial to safeguard Face Recognition (FR) Systems. In real-world scenarios, FRs are confronted with both physical and digital attacks. However, existing algorithms often address only one type of attack at a time, which poses significant limitations in real-world scenarios where FR systems face hybrid physical-digital threats. To facilitate the research of Unified Attack Detection (UAD) algorithms, a large-scale UniAttackData dataset has been collected. UniAttackData is the largest public dataset for Unified Attack Detection, with a total of 28,706 videos, where each unique identity encompasses all advanced attack types. Based on this dataset, we organized a Unified Physical-Digital Face Attack Detection Challenge to boost the research in Unified Attack Detections. It attracted 136 teams for the development phase, with 13 qualifying for the final round. The results re-verified by the organizing team were used for the final ranking. This paper comprehensively reviews the challenge, detailing the dataset introduction, protocol definition, evaluation criteria, and a summary of published results. Finally, we focus on the detailed analysis of the highest-performing algorithms and offer potential directions for unified physical-digital attack detection inspired by this competition. Challenge Website: https://sites.google.com/view/face-anti-spoofing-challenge/welcome/challengecvpr2024.
VI-OOD: A Unified Representation Learning Framework for Textual Out-of-distribution Detection
Authors: Authors: Li-Ming Zhan, Bo Liu, Xiao-Ming Wu
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.06217
Pdf link: https://arxiv.org/pdf/2404.06217
Abstract Out-of-distribution (OOD) detection plays a crucial role in ensuring the safety and reliability of deep neural networks in various applications. While there has been a growing focus on OOD detection in visual data, the field of textual OOD detection has received less attention. Only a few attempts have been made to directly apply general OOD detection methods to natural language processing (NLP) tasks, without adequately considering the characteristics of textual data. In this paper, we delve into textual OOD detection with Transformers. We first identify a key problem prevalent in existing OOD detection methods: the biased representation learned through the maximization of the conditional likelihood $p(y\mid x)$ can potentially result in subpar performance. We then propose a novel variational inference framework for OOD detection (VI-OOD), which maximizes the likelihood of the joint distribution $p(x, y)$ instead of $p(y\mid x)$. VI-OOD is tailored for textual OOD detection by efficiently exploiting the representations of pre-trained Transformers. Through comprehensive experiments on various text classification tasks, VI-OOD demonstrates its effectiveness and wide applicability. Our code has been released at \url{https://github.com/liam0949/LLM-OOD}.
Automatic Defect Detection in Sewer Network Using Deep Learning Based Object Detector
Authors: Authors: Bach Ha, Birgit Schalter, Laura White, Joachim Koehler
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06219
Pdf link: https://arxiv.org/pdf/2404.06219
Abstract Maintaining sewer systems in large cities is important, but also time and effort consuming, because visual inspections are currently done manually. To reduce the amount of aforementioned manual work, defects within sewer pipes should be located and classified automatically. In the past, multiple works have attempted solving this problem using classical image processing, machine learning, or a combination of those. However, each provided solution only focus on detecting a limited set of defect/structure types, such as fissure, root, and/or connection. Furthermore, due to the use of hand-crafted features and small training datasets, generalization is also problematic. In order to overcome these deficits, a sizable dataset with 14.7 km of various sewer pipes were annotated by sewer maintenance experts in the scope of this work. On top of that, an object detector (EfficientDet-D0) was trained for automatic defect detection. From the result of several expermients, peculiar natures of defects in the context of object detection, which greatly effect annotation and training process, are found and discussed. At the end, the final detector was able to detect 83% of defects in the test set; out of the missing 17%, only 0.77% are very severe defects. This work provides an example of applying deep learning-based object detection into an important but quiet engineering field. It also gives some practical pointers on how to annotate peculiar "object", such as defects.
Aggressive or Imperceptible, or Both: Network Pruning Assisted Hybrid Byzantines in Federated Learning
Authors: Authors: Emre Ozfatura, Kerem Ozfatura, Alptekin Kupcu, Deniz Gunduz
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2404.06230
Pdf link: https://arxiv.org/pdf/2404.06230
Abstract Federated learning (FL) has been introduced to enable a large number of clients, possibly mobile devices, to collaborate on generating a generalized machine learning model thanks to utilizing a larger number of local samples without sharing to offer certain privacy to collaborating clients. However, due to the participation of a large number of clients, it is often difficult to profile and verify each client, which leads to a security threat that malicious participants may hamper the accuracy of the trained model by conveying poisoned models during the training. Hence, the aggregation framework at the parameter server also needs to minimize the detrimental effects of these malicious clients. A plethora of attack and defence strategies have been analyzed in the literature. However, often the Byzantine problem is analyzed solely from the outlier detection perspective, being oblivious to the topology of neural networks (NNs). In the scope of this work, we argue that by extracting certain side information specific to the NN topology, one can design stronger attacks. Hence, inspired by the sparse neural networks, we introduce a hybrid sparse Byzantine attack that is composed of two parts: one exhibiting a sparse nature and attacking only certain NN locations with higher sensitivity, and the other being more silent but accumulating over time, where each ideally targets a different type of defence mechanism, and together they form a strong but imperceptible attack. Finally, we show through extensive simulations that the proposed hybrid Byzantine attack is effective against 8 different defence methods.
Towards Robust Domain Generation Algorithm Classification
Authors: Authors: Arthur Drichel, Marc Meyer, Ulrike Meyer
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2404.06236
Pdf link: https://arxiv.org/pdf/2404.06236
Abstract In this work, we conduct a comprehensive study on the robustness of domain generation algorithm (DGA) classifiers. We implement 32 white-box attacks, 19 of which are very effective and induce a false-negative rate (FNR) of $\approx$ 100\% on unhardened classifiers. To defend the classifiers, we evaluate different hardening approaches and propose a novel training scheme that leverages adversarial latent space vectors and discretized adversarial domains to significantly improve robustness. In our study, we highlight a pitfall to avoid when hardening classifiers and uncover training biases that can be easily exploited by attackers to bypass detection, but which can be mitigated by adversarial training (AT). In our study, we do not observe any trade-off between robustness and performance, on the contrary, hardening improves a classifier's detection performance for known and unknown DGAs. We implement all attacks and defenses discussed in this paper as a standalone library, which we make publicly available to facilitate hardening of DGA classifiers: https://gitlab.com/rwth-itsec/robust-dga-detection
Label-Efficient 3D Object Detection For Road-Side Units
Authors: Authors: Minh-Quan Dao, Holger Caesar, Julie Stephany Berrio, Mao Shan, Stewart Worrall, Vincent Frémont, Ezio Malis
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.06256
Pdf link: https://arxiv.org/pdf/2404.06256
Abstract Occlusion presents a significant challenge for safety-critical applications such as autonomous driving. Collaborative perception has recently attracted a large research interest thanks to the ability to enhance the perception of autonomous vehicles via deep information fusion with intelligent roadside units (RSU), thus minimizing the impact of occlusion. While significant advancement has been made, the data-hungry nature of these methods creates a major hurdle for their real-world deployment, particularly due to the need for annotated RSU data. Manually annotating the vast amount of RSU data required for training is prohibitively expensive, given the sheer number of intersections and the effort involved in annotating point clouds. We address this challenge by devising a label-efficient object detection method for RSU based on unsupervised object discovery. Our paper introduces two new modules: one for object discovery based on a spatial-temporal aggregation of point clouds, and another for refinement. Furthermore, we demonstrate that fine-tuning on a small portion of annotated data allows our object discovery models to narrow the performance gap with, or even surpass, fully supervised models. Extensive experiments are carried out in simulated and real-world datasets to evaluate our method.
Robust feature knowledge distillation for enhanced performance of lightweight crack segmentation models
Authors: Authors: Zhaohui Chen, Elyas Asadi Shamsabadi, Sheng Jiang, Luming Shen, Daniel Dias-da-Costa
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06258
Pdf link: https://arxiv.org/pdf/2404.06258
Abstract Vision-based crack detection faces deployment challenges due to the size of robust models and edge device limitations. These can be addressed with lightweight models trained with knowledge distillation (KD). However, state-of-the-art (SOTA) KD methods compromise anti-noise robustness. This paper develops Robust Feature Knowledge Distillation (RFKD), a framework to improve robustness while retaining the precision of light models for crack segmentation. RFKD distils knowledge from a teacher model's logit layers and intermediate feature maps while leveraging mixed clean and noisy images to transfer robust patterns to the student model, improving its precision, generalisation, and anti-noise performance. To validate the proposed RFKD, a lightweight crack segmentation model, PoolingCrack Tiny (PCT), with only 0.5 M parameters, is also designed and used as the student to run the framework. The results show a significant enhancement in noisy images, with RFKD reaching a 62% enhanced mean Dice score (mDS) compared to SOTA KD methods.
Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping
Authors: Authors: Anas Gouda, Max Schwarz, Christopher Reining, Sven Behnke, Alice Kirchheim
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06277
Pdf link: https://arxiv.org/pdf/2404.06277
Abstract Foundation models are a strong trend in deep learning and computer vision. These models serve as a base for applications as they require minor or no further fine-tuning by developers to integrate into their applications. Foundation models for zero-shot object segmentation such as Segment Anything (SAM) output segmentation masks from images without any further object information. When they are followed in a pipeline by an object identification model, they can perform object detection without training. Here, we focus on training such an object identification model. A crucial practical aspect for an object identification model is to be flexible in input size. As object identification is an image retrieval problem, a suitable method should handle multi-query multi-gallery situations without constraining the number of input images (e.g. by having fixed-size aggregation layers). The key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids. CTL yields high accuracy, avoids misleading training signals and keeps the model input size flexible. In our experiments, we establish a new state of the art on the ArmBench object identification task, which shows general applicability of our model. We furthermore demonstrate an integrated unseen object detection pipeline on the challenging HOPE dataset, which requires fine-grained detection. There, our pipeline matches and surpasses related methods which have been trained on dataset-specific data.
Experimental System Design of an Active Fault-Tolerant Quadrotor
Authors: Authors: Jennifer Yeom, Roshan Balu T M B, Guanrui Li, Giuseppe Loianno
Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2404.06340
Pdf link: https://arxiv.org/pdf/2404.06340
Abstract Quadrotors have gained popularity over the last decade, aiding humans in complex tasks such as search and rescue, mapping and exploration. Despite their mechanical simplicity and versatility compared to other types of aerial vehicles, they remain vulnerable to rotor failures. In this paper, we propose an algorithmic and mechanical approach to addressing the quadrotor fault-tolerant problem in case of rotor failures. First, we present a fault-tolerant detection and control scheme that includes various attitude error metrics. The scheme transitions to a fault-tolerant control mode by surrendering the yaw control. Subsequently, to ensure compatibility with platform sensing constraints, we investigate the relationship between variations in robot rotational drag, achieved through a modular mechanical design appendage, resulting in yaw rates within sensor limits. This analysis offers a platform-agnostic framework for designing more reliable and robust quadrotors in the event of rotor failures. Extensive experimental results validate the proposed approach providing insights into successfully designing a cost-effective quadrotor capable of fault-tolerant control. The overall design enhances safety in scenarios of faulty rotors, without the need for additional sensors or computational resources.
Generalizable Sarcasm Detection Is Just Around The Corner, Of Course!
Authors: Authors: Hyewon Jang, Diego Frassinelli
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2404.06357
Pdf link: https://arxiv.org/pdf/2404.06357
Abstract We tested the robustness of sarcasm detection models by examining their behavior when fine-tuned on four sarcasm datasets containing varying characteristics of sarcasm: label source (authors vs. third-party), domain (social media/online vs. offline conversations/dialogues), style (aggressive vs. humorous mocking). We tested their prediction performance on the same dataset (intra-dataset) and across different datasets (cross-dataset). For intra-dataset predictions, models consistently performed better when fine-tuned with third-party labels rather than with author labels. For cross-dataset predictions, most models failed to generalize well to the other datasets, implying that one type of dataset cannot represent all sorts of sarcasm with different styles and domains. Compared to the existing datasets, models fine-tuned on the new dataset we release in this work showed the highest generalizability to other datasets. With a manual inspection of the datasets and post-hoc analysis, we attributed the difficulty in generalization to the fact that sarcasm actually comes in different domains and styles. We argue that future sarcasm research should take the broad scope of sarcasm into account.
Keyword: face recognition

Greedy-DiM: Greedy Algorithms for Unreasonably Effective Face Morphs
Authors: Authors: Zander W. Blasingame, Chen Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.06025
Pdf link: https://arxiv.org/pdf/2404.06025
Abstract Morphing attacks are an emerging threat to state-of-the-art Face Recognition (FR) systems, which aim to create a single image that contains the biometric information of multiple identities. Diffusion Morphs (DiM) are a recently proposed morphing attack that has achieved state-of-the-art performance for representation-based morphing attacks. However, none of the existing research on DiMs have leveraged the iterative nature of DiMs and left the DiM model as a black box, treating it no differently than one would a Generative Adversarial Network (GAN) or Varational AutoEncoder (VAE). We propose a greedy strategy on the iterative sampling process of DiM models which searches for an optimal step guided by an identity-based heuristic function. We compare our proposed algorithm against ten other state-of-the-art morphing algorithms using the open-source SYN-MAD 2022 competition dataset. We find that our proposed algorithm is unreasonably effective, fooling all of the tested FR systems with an MMPMR of 100%, outperforming all other morphing algorithms compared.
Unified Physical-Digital Attack Detection Challenge
Authors: Authors: Haocheng Yuan, Ajian Liu, Junze Zheng, Jun Wan, Jiankang Deng, Sergio Escalera, Hugo Jair Escalante, Isabelle Guyon, Zhen Lei
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06211
Pdf link: https://arxiv.org/pdf/2404.06211
Abstract Face Anti-Spoofing (FAS) is crucial to safeguard Face Recognition (FR) Systems. In real-world scenarios, FRs are confronted with both physical and digital attacks. However, existing algorithms often address only one type of attack at a time, which poses significant limitations in real-world scenarios where FR systems face hybrid physical-digital threats. To facilitate the research of Unified Attack Detection (UAD) algorithms, a large-scale UniAttackData dataset has been collected. UniAttackData is the largest public dataset for Unified Attack Detection, with a total of 28,706 videos, where each unique identity encompasses all advanced attack types. Based on this dataset, we organized a Unified Physical-Digital Face Attack Detection Challenge to boost the research in Unified Attack Detections. It attracted 136 teams for the development phase, with 13 qualifying for the final round. The results re-verified by the organizing team were used for the final ranking. This paper comprehensively reviews the challenge, detailing the dataset introduction, protocol definition, evaluation criteria, and a summary of published results. Finally, we focus on the detailed analysis of the highest-performing algorithms and offer potential directions for unified physical-digital attack detection inspired by this competition. Challenge Website: https://sites.google.com/view/face-anti-spoofing-challenge/welcome/challengecvpr2024.
Keyword: augmentation

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding
Authors: Authors: Mingrui Wu, Sheng Cao
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2404.05825
Pdf link: https://arxiv.org/pdf/2404.05825
Abstract Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.
Towards Improved Semiconductor Defect Inspection for high-NA EUVL based on SEMI-SuperYOLO-NAS
Authors: Authors: Ying-Lin Chen, Jacob Deforce, Vic De Ridder, Bappaditya Dey, Victor Blanco, Sandip Halder, Philippe Leray
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.05862
Pdf link: https://arxiv.org/pdf/2404.05862
Abstract Due to potential pitch reduction, the semiconductor industry is adopting High-NA EUVL technology. However, its low depth of focus presents challenges for High Volume Manufacturing. To address this, suppliers are exploring thinner photoresists and new underlayers/hardmasks. These may suffer from poor SNR, complicating defect detection. Vision-based ML algorithms offer a promising solution for semiconductor defect inspection. However, developing a robust ML model across various image resolutions without explicit training remains a challenge for nano-scale defect inspection. This research's goal is to propose a scale-invariant ADCD framework capable to upscale images, addressing this issue. We propose an improvised ADCD framework as SEMI-SuperYOLO-NAS, which builds upon the baseline YOLO-NAS architecture. This framework integrates a SR assisted branch to aid in learning HR features by the defect detection backbone, particularly for detecting nano-scale defect instances from LR images. Additionally, the SR-assisted branch can recursively generate upscaled images from their corresponding downscaled counterparts, enabling defect detection inference across various image resolutions without requiring explicit training. Moreover, we investigate improved data augmentation strategy aimed at generating diverse and realistic training datasets to enhance model performance. We have evaluated our proposed approach using two original FAB datasets obtained from two distinct processes and captured using two different imaging tools. Finally, we demonstrate zero-shot inference for our model on a new, originating from a process condition distinct from the training dataset and possessing different Pitch characteristics. Experimental validation demonstrates that our proposed ADCD framework aids in increasing the throughput of imaging tools for defect inspection by reducing the required image pixel resolutions.
Automated National Urban Map Extraction
Authors: Authors: Hasan Nasrallah, Abed Ellatif Samhat, Cristiano Nattero, Ali J. Ghandour
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2404.06202
Pdf link: https://arxiv.org/pdf/2404.06202
Abstract Developing countries usually lack the proper governance means to generate and regularly update a national rooftop map. Using traditional photogrammetry and surveying methods to produce a building map at the federal level is costly and time consuming. Using earth observation and deep learning methods, we can bridge this gap and propose an automated pipeline to fetch such national urban maps. This paper aims to exploit the power of fully convolutional neural networks for multi-class buildings' instance segmentation to leverage high object-wise accuracy results. Buildings' instance segmentation from sub-meter high-resolution satellite images can be achieved with relatively high pixel-wise metric scores. We detail all engineering steps to replicate this work and ensure highly accurate results in dense and slum areas witnessed in regions that lack proper urban planning in the Global South. We applied a case study of the proposed pipeline to Lebanon and successfully produced the first comprehensive national building footprint map with approximately 1 Million units with an 84% accuracy. The proposed architecture relies on advanced augmentation techniques to overcome dataset scarcity, which is often the case in developing countries.

LeeKyungwook / get-arxiv-noti

New submissions for Wed, 10 Apr 24 #1057

Keyword: detection

Localizing Moments of Actions in Untrimmed Videos of Infants with Autism Spectrum Disorder

Towards Improved Semiconductor Defect Inspection for high-NA EUVL based on SEMI-SuperYOLO-NAS

On the Fly Robotic-Assisted Medical Instrument Planning and Execution Using Mixed Reality

Interference Reduction Design for Improved Multitarget Detection in ISAC Systems

ClusterRadar: an Interactive Web-Tool for the Multi-Method Exploration of Spatial Clusters Over Time

Deep Learning-Based Out-of-distribution Source Code Data Identification: How Far We Have Gone?

On Achievable Covert Communication Performance under CSI Estimation Error and Feedback Delay

Boosting Digital Safeguards: Blending Cryptography and Steganography

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

Band-Attention Modulated RetNet for Face Forgery Detection

Improving Facial Landmark Detection Accuracy and Efficiency with Knowledge Distillation

Gaussian Pancakes: Geometrically-Regularized 3D Gaussian Splatting for Realistic Endoscopic Reconstruction

SmurfCat at SemEval-2024 Task 6: Leveraging Synthetic Data for Hallucination Detection

Differential Privacy for Anomaly Detection: Analyzing the Trade-off Between Privacy and Explainability

Enhanced Radar Perception via Multi-Task Learning: Towards Refined Data for Sensor Fusion Applications

YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

Unified Physical-Digital Attack Detection Challenge

VI-OOD: A Unified Representation Learning Framework for Textual Out-of-distribution Detection

Automatic Defect Detection in Sewer Network Using Deep Learning Based Object Detector

Aggressive or Imperceptible, or Both: Network Pruning Assisted Hybrid Byzantines in Federated Learning

Towards Robust Domain Generation Algorithm Classification

Label-Efficient 3D Object Detection For Road-Side Units

Robust feature knowledge distillation for enhanced performance of lightweight crack segmentation models

Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping

Experimental System Design of an Active Fault-Tolerant Quadrotor

Generalizable Sarcasm Detection Is Just Around The Corner, Of Course!

Keyword: face recognition

Greedy-DiM: Greedy Algorithms for Unreasonably Effective Face Morphs

Unified Physical-Digital Attack Detection Challenge

Keyword: augmentation

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

Towards Improved Semiconductor Defect Inspection for high-NA EUVL based on SEMI-SuperYOLO-NAS

Automated National Urban Map Extraction